A Novel Low-Cost Real-Time Power Measurement Platform for LoWPAN IoT Devices

With the rapid development of technology and application for Internet ofThings (IoT), Low-PowerWireless Personal AreaNetwork (LoWPAN) devices aremore popularly applied. Evaluation of power efficiency is important to LoWPANapplications. Conventional method to evaluate the power efficiency of different LoWPAN devices is as follows: first measure the current of the devices under working/idle/sleep state and then make an average and estimation of the lifetime of batteries, which deeply relied on the accuracy of testing equipment and is not that accurate and with high cost. In this work, a low-cost, real-time power measurement platform called PTone is proposed, which can be used to detect the real-time power of LoWPAN devices (above 99.63%) and be able to determine the state of eachmodule of DUT system. Based on the PTone, a novel abnormal status diagnosis mechanism is proposed. The mechanism can not only judge abnormal status but also find accurate abnormal status locating and classify abnormal status accurately. According to the method, each state of Device Under Test (DUT) during wireless transmission is estimated, different abnormal status can be classified, and thus specific location of abnormal module can be found, which will significantly shorten the development process for LoWPAN devices and thus reduce costs.

With the rapid development of connected embedded devices being placed everywhere in our everyday life, the vision of the Internet of Things (IoT) is coming close to reality where billions of physical world devices have a digital presence on the Internet. However, despite the importance of IoT, there are still several technical challenges remaining before the next Internet revolution. This special issue has focused on technical challenges that can enable IoT. We call for manuscripts presenting and discussing the most recent advances in embedded systems for IoT. Until the deadline of the special issue, 17 manuscripts have been received worldwide. After the review process, 7 manuscripts have been accepted by this special issue. The accepted research manuscripts have focused on the energy efficiency and power transfer for IoT devices, device-to-device communication, multithreading for multicore IoT platforms, and topology construction and scalability of IoT networks. Accepted manuscripts present the important research findings and these advances will contribute to the development of IoT.
Evaluation of power efficiency is important for low power wireless personal area network (LoWPAN) devices and applications in IoT. Conventional methods to evaluate the power efficiency of LoWPAN devices rely deeply on the accuracy of the testing equipment, which trades off high cost with limited accuracy. To tackle this challenge, a low cost, real-time power measurement platform called PTone is proposed, which can be used to detect the real-time power usage of LoWPAN devices and be able to determine the state of each module of device under test. Based on PTone, an abnormal status diagnosis mechanism is developed which can not only judge abnormal status, but also classify the abnormal status and locate the abnormality causing module accurately.
Power transfer wirelessly is another promising technology that can enable a massive number of wireless devices in IoT. For this purpose, destination-aided simultaneous wireless information and power transfer (SWIPT) is proposed for a decode-and-forward relay network, in which massive multiple-input multiple-output (MIMO) antennas are deployed at relay to assist communications among multiple source-destination pairs. During relaying, energy signals are emitted from multiple destinations when multiple sources are sending their information signals to relay. Analysis reveals that asymptotic harvested energy is independent of the fast fading effect of wireless channels; meanwhile, transmission powers of each source and destination can be scaled down inversely proportional to the number of relay antennas. To reduce energy leakage interference and multipair interference, zero-forcing processing and maximum ratio transmission are employed at relays. Fundamental trade-off between harvested energy and achievable sum rate is quantified, and it is shown that asymptotic sum rate is neither convex nor concave with respect to power splitting and destination transmission power. Thus, a one-dimensional embedded bisection algorithm can be used to jointly determine the optimal power splitting and destination transmission power, 2 Mobile Information Systems which shows that destination-aided SWIPT are beneficial for harvesting energy and increasing the sum rate.
For the past decade, wireless sensor networks have focused primarily on data collection. As a result, the network topology for these systems was usually heavily centralized. However, for these networks to form a full IoT system, the introduction of proper actuation units and decision-making intelligence is inevitable. To that end, a new distributed wireless sensor and actuator network (DWSAN) system has been proposed. The DWSAN system architecture effectively combines both sensor and actuation hardware devices to a single network and manages this network so that the actuation decisions are made in a distributed manner and the topology of the network maintains a multitier architecture. Intensive sets of evaluations reveal that, compared to the centralized approaches that have been used in most wireless sensor network systems until now, when actuation units are introduced to the system, the DWSAN architecture reduces the transmission load of the network and the actuation decision-making latency by close to twofold and threefold, respectively. Furthermore, this benefit naturally leads to better scalability of the system, making it suitable for various sensing applications in different environments.
Device heterogeneity and device-to-device (D2D) communication are another important enabler for the Internet of Things. For the D2D communication, transmit and receive beamforming with the channel state information for virtual MIMO enabled by D2D receive cooperation can be used. Analysis on the sum rate achieved by a device pair has been made, and a distributed algorithm for device pairing to maximize the throughput of the multidevice network has been developed in this special issue. Furthermore, for heterogeneous multicore IoT systems, efficient thread mapping and scheduling mechanism primed for heterogeneously configured multicore systems had been proposed which considers CPU utilization for mapping running threads with the appropriate core that can potentially deliver the actual needed capacity.
Finally, topology management for IoT has been studied analytically. For successful realization of IoT, challenges such as heterogeneous connectivity, ubiquitous coverage, reduced network and device complexity, enhanced power saving, and enhanced resource management have to be solved, and all these challenges are heavily impacted by the IoT network

Introduction
The Internet of things (IoT) connecting billions of devices is anticipated to make our society smarter by collecting the remotely sensed data from the devices and actuating the devices autonomously based on the collected data [1,2]. The networks for the IoT would connect the devices not only to an access point (AP) but also to neighboring devices through direct communication links, which are often observed in machine-to-machine (M2M) and device-to-device (D2D) communications [3][4][5]. In addition, the direct links are controlled by the network as a whole or in part to guarantee the quality-of-services. This paper utilizes the direct links of M2M and D2D communications to assist the information transfer from an AP to the devices through cooperation. There exist various cooperation methods such as amplify-and-forward (AF), decode-and-forward (DF), compress-and-forward (CF), and their variations to enlarge the coverage area or to improve the reliability of information delivery [6][7][8][9]. Among the methods, CF-based receive cooperation has attracted attention recently as a method of improving the system capacity by forming a virtual multiple-input multiple-output (MIMO) configuration with single-antenna devices [10][11][12][13].
The CF-based receive cooperation is considered for an AP with two antennas supporting two devices in [10,11] from the aspect of physical layer designs with adaptive modulation and coding. A higher layer issue of device matching and resource sharing has been studied for the network with multiple devices in [12,13], where a distributive message passing algorithm is derived based on the bargaining in exchange networks [14][15][16][17][18]. These studies have shown that the CFbased receive cooperation not only enjoys a proximity gain of the closely located devices but also provides a flexibility in the design of the D2D links by allowing any type of physical layer standard for the D2D links.
This paper proposes a network-assisted D2D communication network as one wireless communication platform for the IoT, which forms the virtual MIMO with multipleantenna devices to enhance the rate from the physical layer aspect and implements the distributed device pairing algorithms to improve the overall network throughput. The contributions of this paper are summarized as follows. Firstly, the beamforming (BF) and power allocation (PA) strategies exploiting the transmit CSI are proposed for the virtual MIMO to improve the rate performance of the downlink physical layer. Secondly, several distributive algorithms for device pairing are derived from the bargaining [15] and 2 Mobile Information Systems auction [19] to provide the network throughput equivalent to that of the optimal pairing. Thirdly, we verify the synergy of combining the virtual MIMO and device pairing algorithms under various system configurations in the scenario of IoT with D2D communications. It should be further noted that the D2D-assisted virtual MIMO with its enhancements can be adopted in the mobile cellular networks providing the proximity services [4] and the distributive pairing algorithms can be combined with various cooperation strategies among the devices for information sharing and processing [9].
The rest of this paper is organized as follows. After providing notations used in this paper, we describe the system model with and without D2D cooperation in Section 2. Section 3 investigates several methods to improve the sum rate of the two devices in D2D cooperation. Device pairing algorithms are devised from bargaining and auction to maximize the network throughput in Section 4. The performance of the network is evaluated in different aspects in Section 5 followed by the concluding remarks in Section 6.

Notations. We denote by A , A , tr[A], and [A]
, the transpose, Hermitian, trace, and ( , )th element of a matrix A, respectively; diag(x) denotes the diagonal matrix with diagonal vector x, I denotes the × identity matrix, and 0 × denotes the × matrix with all-zero elements. We denote by C × the space for × complex matrices. The operator [⋅] denotes the expectation, ∼ signifies "distributed as," and CN(x, Σ) denotes the distribution of a vector x with complex Gaussian elements with mean x and covariance matrix Σ.

System Model
Consider an IoT network with an AP and devices as described in Figure 1(a), where the AP is equipped with transmit antennas and each device is equipped with receive antennas. We consider asymmetric antenna configurations as ≥ 2 , which are often encountered in practical scenarios with high complexity APs and low complexity IoT devices. The downlink (DL) from the AP to the devices utilizes a licensed spectrum of bandwidth for long-range communications, while the D2D links utilize an unlicensed spectrum of bandwidth for relatively shortrange communications as shown in Figure 1(b); although an unlicensed spectrum is also available at a very higher frequency as 60 GHz, we limit our description to an unlicensed spectrum at a carrier frequency below 6 GHz which has a propagation model similar to that of the licensed spectrum. The DL frame is divided into equilength time slots, each of which is dedicated to each device to be shared with the paired device.
The channel from the AP to device is modeled by the channel response matrix H = [h ,1 h ,2 , . . . , h , ] ∈ C × . The component vector is modeled as h , ∼ CN(0 ×1 , I ), where = −] is the path-loss at device determined by the distance from the AP, path-loss constant , and path-loss exponent ], and is the log-normal random variable reflecting the shadowing. The D2D link from device to device is described by the channel matrix G = [g ,1 , g ,2 , . . . , g , ], where g , ∼ CN(0 ×1 , I ) and = −] is the path-loss between devices and at distance . The channel reciprocity holds in D2D links as G = G . This paper assumes that the AP has the perfect CSI on the channels {H } =1 . The network pairs the devices for signal transmission with transmit beamforming (BF) over the time slots allocated to the devices. Without D2D receive cooperation, the network supports × MIMO from the AP to each device in the pair. If the devices in the pair participate in D2D receive cooperation, the network forms × 2 virtual MIMO from the AP and each device in the pair which can support min( , 2 ) spatially multiplexed symbols.
Let devices and be paired for signal transmission over their time slots with transmit BF W . The transmit symbol vector for devices ( , ) is modeled as where is the transmit power of the AP, s are the 2 × 1 symbol vector with [s s ] = (1/2 )I 2 , and = 2 / tr[W W ] is the scaling factor satisfying the transmit power . The corresponding received signal vector at device ∈ { , } is expressed as y = [ ,1 , ,2 , . . . , , ] = √ H W s + n , (2) where n ∼ CN(0 ×1 , 2 I ) is the background-noise vector at device .
With D2D cooperation that exchanges the compressed versions of the received signals between the paired devices, the received signals available at device ∈ { , } are given bỹ y = [y ,ỹ ] = √ H W s +ñ , where , is the source coding rate employed at device for compression of { , } =1 . Therefore, we have ] .
Under this model, the network employs the transmit BF at the BS; that is, and d2d = 1, whereṼ ∈ C 2 ×2 is the matrix constructed with the first 2 columns of V from singular value decomposition (SVD) of the pair channel matrix H = U Σ V . Device receiving signals in (3) through D2D cooperation uses the receive BF U such that With D2D cooperation, the transmit symbol vector s can be devoted to symbol transmission, for device only, for device only, or for both, according to the resource allocation. For comparison, we consider the system without D2D cooperation. Up to symbols can be multiplexed for each device at the AP as s = [s , s ] , where s ∈ C ×1 is the symbol vector for device ∈ { , }. The transmit BF W ,no without D2D cooperation can be designed as by means of the block diagonalization (BD) method to decode s from (2) at device ∈ ( , ) as in [21]. To be specific, W ,no is designed to be orthogonal to H but to be steered to H as W ,no = V ⊥ V ,eq , where V ⊥ ∈ C ×( − ) is the submatrix of V corresponding to the null space of H = U Σ V andṼ ,bd ∈ C ( − )× is the submatrix corresponding to nonzero eigenvalues from the right unitary matrix V ,bd of H V ⊥ = U ,bd Σ ,bd V ,bd . The design of W ,no is obtained in a similar way. Therefore, we have and no = 1. By applying U ,bd to y at device ∈ { , }, we have z = U ,bd y = √ Σ ,bd s + U ,bd n .
It should be noted that 2 received signals are available at each device, while received signals are available at each device, and thus BF strategies are different.

Improvement of Sum Rate of a Device Pair
With D2D cooperation, the sum rate achieved by the pair ( , ) is a linear combination of ratesR , delivered to device when the two time slots are devoted to device , andR , delivered to device when the two time slots are devoted to device . The sum rate achieved by a device pair is given by S =R +R bps, (11) where nonnegative factors and denote the fractions of the two time slots allocated to devices and , respectively, subject to + = 1. Here,R is given bỹ , ) bps (12) from (7), where , = [Σ 2 ] , and The rateR delivered to device is also obtained in a similar form by exchanging and .
To improve the rate (12), we need to reduce 2 , or equivalently 2 , , given in (4). We can reduce 2 , , by increasing the source coding rate , as much as possible. The source coding rate for device to compress the received signal samples { , } =1 to be forwarded to device is limited by the channel capacity = log 2 det (I + 2 G G ) of the D2D link, where and 2 are the transmit power and noise power of the device, respectively. Specifically, for the forwarded samples to be recovered at device without errors, their transmission rate , over the D2D link should not exceed . Therefore, the source coding rate , is upperbounded by max , = log 2 det (I + 2 G G ) , where = / is the bandwidth ratio.
With the maximum source coding rate, the variance of the compression noise is given by from (4) and (15). From (16), we can suggest some strategies to improve the sum rate achieved by a device pair in D2D cooperation. Firstly, the compression nose variance decreases exponentially as the bandwidth expansion increases. For the fixed bandwidth expansion ratio as = 1, the compression noise variance is reduced when the D2D link quality is much better than the DL quality in an average sense as We further improve the sum rate of the device pair by applying power allocation (PA) = [ ,1 , ,2 , . . . , ,2 ] such that [s s ] = diag( ) and ∑ 2 =1 , = 1. In the case, the sum rate of the device pair is given bỹ which is maximized by applying the water-filing algorithm with the information on {( / 2 , ) , } 2 =1 [20]. Although the CSI on { , } =1 is available at the AP, the noise variance 2 , is not easy to be informed to the AP since it depends on the transmit signal and D2D link qualities. Therefore, the AP can compute the PA with water-filling algorithm by assuming 2 , = 2 . This sum rate of the device pair becomes the maximum when 2 , = 2 or equivalently when = ∞. For comparison, the achievable sum rate of pair ( , ) using two slots without D2D cooperation is given by ,no = ,no + ,no , where ,no = 2 ∑ =1 log 2 det (1 + 2 2 , ,bd ) bps (19) from (10), where , ,bd = [Σ 2 , ] , .

Distributed Pairing Algorithms
This section aims at forming the pairs to maximize the average normalized throughput of the network. The problem is formulated as where Π is the × matrix with [Π] , = . The pairing indicator ∈ {0, 1} is given by = 1 when devices and are paired and = 0, otherwise. We first solve the problem by using a distributed algorithm based on Nash bargaining as proposed in [12] not only to maximize the average throughput but also to satisfy each device's desire to obtain the best bargaining gain. We then derive the auction algorithms for device paring which are equivalent to the bargaining algorithm.

Bargaining Algorithm Based on Message
Passing. The problem is described by a graph (V, E) in an exchange network [14], where V is the set of vertices corresponding to the devices and E is the set of edges corresponding to the D2D links among the devices. The edge ( , ) ∈ E connecting vertices and is associated with a positive weight = S min = min (R ,R ) (21) which corresponds to the sum rate of the device pair ( , ) ensured by any sharing policy ( , ); without knowing ( , ) in advance, we estimate the sum rate of the device pair pessimistically with the minimum value. Let (M(Π), ) denote a configuration of the device pairing and resource sharing, where M(Π) is a matching described by Π in graph (V, E) and = { : ∈ V} is the set of the rates that the devices can obtain from the matching M(Π). The solution is obtained with a balanced configuration [14] satisfying the stability (2) repeat (3) At device , update offer ( ) using (24) for ∈ N( ) and send it back to device (4) On receipt of ( ) for all devices, update ( ) using (25). (5) At device , update alternative rate ( ) using (26) ∈ N( ) and send it back to device . (6) ← + 1. (7) until All messages converge or max number of iterations are reached. (8) At device , fix and for all ∈ N( ) to determine using (25), and find partner device ∈ N( ) such that = .
Algorithm 1: Distributed matching and resource sharing algorithm.
This stable balanced configuration can be found by a distributed algorithm based on a message passing (MP) framework [17,18,22,23], where each device transfers a nonnegative message on the current best alternative in device matching to its neighbors at each time instant. Let ( ) denote the message from device to device at time , representing the resource share which is guaranteed for device in the pair ( , ). That is, ( ) is an alternative rate that device can obtain maximally up to time instant when pairing with its neighbor ∈ N( ) \ except for . Initially, (0) is set to zero for all ( , ) ∈ E since there is no guarantee ensured by the neighboring devices initially. Based on the alternative rate, an individual device makes an offer ( ) to each of its neighbors for D2D cooperation, where the offer from device to device is computed as With the offers from the other devices, device estimates its maximal rate using The update rule for ( ) is given by wherẽ( +1) is an estimate of the rate guaranteed for device and ≈ 0.7 is the damping factor [12]. With the updated ( +1) , the repeated updates of ( +1) and ( +1) proceed using (24) and (25), respectively. The iteration continues until ( +1) converges for all available pairs ( , ). Let , , and denote the corresponding values to which ( ) , ( ) , and ( ) converge, that is, a fixed point of the algorithm. Upon convergence, device sets its share to and chooses device ∈ N( ) with = as its partner. From , the device obtains the resource sharing factor = / . This procedure is summarized in Algorithm 1.

Equivalent Auction Algorithm for Device Pairing.
The auction algorithm finds the maximum weighted matching (MWM) via an auction, based on the notion of buyers and prices [19]. The MWM corresponds to the optimal pairing providing the maximum throughput so that we briefly describe the auction algorithm in terms of messages defined in the previous section.
We can set the alternative rate ( ) to the price which buyer pays to buy product and set the sum rate to the value of product , respectively. The benefit that can be earned when buyer buys product is given by − ( ) . The goal is to find the assignment of all buyers and products that maximizes the overall benefit, that is, ∑ ( , ) ( − ( ) ) .
Therefore, the sum ∑ ( , ) ( ) of the prices is constant and the MWM maximizes the overall benefit [15]. To handle this problem, the auction algorithm proceeds as in Algorithm 2. This algorithm is known to converge to the MWM if a solution exists [15,19].
In this algorithm, the maximum is taken over all incoming ( ) when updating ( +1) . In addition, device does not bid at every iteration and to every neighboring device, thereby slowing down the convergence speed. Also, the auction algorithm works only for a bipartite network where one class of buyers is assigned only to the other class of products. Thus, the algorithm should be modified to allow the MWM for nonbipartite network where buyers can also become products. Note that, in a nonbipartite network, the net benefit of device corresponds to the price that device should pay to form a pair ( , ). Subsequently, device does not necessarily consider the offer from device to determine the price that device should pay for pairing. To all of those objectives, the auction algorithm is modified to obtain a new algorithm running in a nonbipartite network, and the resulting algorithm is summarized in Algorithm 3.
In this algorithm, device can bid to all neighboring device at each iteration instead of updating ( ) by adding a small constant . Here, note that the auction algorithm in this subsection bears very close resemblance to the bargaining algorithm in Section 4.1. If we drop the second term in (24), which corresponds to dividing the surplus rate gain equally, we end up with Algorithm 3. Thus, the developed algorithm only differs from the modified auction algorithm in that it (1) and update offer ( ) using ( ) = − + with a certain positive constant and send it to device .
On receipt of ( ) at each device , calculate alternative rate gain ( ) using (26) and send it back to device . (5) ← + 1. (6) until All messages converge or max number of iterations are reached. (7) At device , fix and for all ∈ N( ) to determine using (25), and find partner device ∈ N( ) such that = .
includes an extra term that reduces "offers" by half the surplus rate gain. In fact, this algorithm is proved to converge to the MWM if a unique solution exists using the same technique as [12]. We now consider the generalization of the message update rule for (24) as We see that (27) leads to the bargaining algorithm in Section 4.1 with = 1/2 and the modified auction algorithm with = 0, respectively. In fact, for ∈ [0, 1), the corresponding algorithm finds the MWM if a unique solution exists. This statement can be proved using the technique in [16]. Thus, the distributive algorithms provided in this work can also find the MWM.

Numerical Results
We evaluate the performance of the network with = 8 devices when = 29 dBm, = 23 dBm, = 10 MHz, and 2 = 2 = −104.5 dBm. The devices are clustered in the circle of radius of which the center is located at distance from the AP. In the circle, the location of the devices is randomly generated with uniform distribution. The SNR at the center of the cluster circle is denoted by for a reference parameter. We assume the unlicensed spectrum of bandwidth = 10 MHz, 20 MHz, and 40 MHz for = 1, = 2, and = 4. The path-loss exponent is set to ] = 3.76 and the shadowing deviation is set to 10 dB for both the licensed and unlicensed spectrums. Figure 2 shows the performance of the cluster with = 300 m and = 200 m as the cluster center SNR varies when = 2, = 2, = 1, and = −133.5 in dB. In the figure, "D2D-PA" and "D2D" indicate the proposed schemes with water-filling PA and equal PA, respectively, while "No D2D" denotes the BD-based BF without D2D cooperation and "D2D (No CSI)" indicates the D2D cooperation without the transmit CSI; here, "D2D (No CSI)" applies the zeroforcing receive BF at the devices. For the benchmark, we also provide the performance with ideal D2D link ("Ideal") having no compression noise, or equivalently = ∞. The lines in the figure correspond to the results with optimal device matching obtained by finding the maximum value after searching for all possible pairing configurations while the marks correspond to the results with the distributive algorithm presented in the paper. The figure shows that the distributive algorithm provides the performance almost indistinguishable from the optimal performance. The transmit CSI on the DL improves the average throughput for both cases of the network with D2D cooperation and with no D2D cooperation. The best performance is achieved with "D2D-PA" exploiting the transmit CSI and D2D cooperation.  subfigures describe a matching of "D2D" and "No D2D," respectively, for the same channel realization. The numbers inside the squares identify the devices and the numbers near to edges of the same color are the sum rates (weights) of the device pairs associated with the edges. We also show the rates (shares) of the devices near to the devices in Figure 3(a). Note that "D2D" can support the rates of the devices by choosing the resource sharing factors ( , ) while "No D2D" cannot support them. With D2D cooperation, the devices in proximity are likely to be paired to improve the quality of the D2D links as shown in Figure 3(a). On the other hand, the devices without D2D cooperation tend to be paired independent of their locations but dependent on the DL channel qualities as shown in Figure 3(b). As expected from the distributive algorithm, a device in a better condition tends to take higher rate than the other device in the pair with D2D cooperation; for instance, device 1 having more neighbors to bargain with tends to take higher rate than device 2 having less neighbors. Figure 4 compares the performance of the network with that of transmit CSI for different antenna configurations × when = 2 and the other conditions remain unchanged from Figure 2. As the number ( ) of antennas increases, the throughput is improved for all methods by multiplexing more symbols at the transmitter. In addition, the gain of "D2D-PA" over "No D2D" increases with the number of antennas by concentrating the signal power to the desired channel modes. Unfortunately, the performance gain with D2D cooperation is marginal for 2 × 1 antenna configuration.   To improve the performance of D2D cooperation for = 2 and = 1, we increase the bandwidth expansion ratio in Figure 5 while the other conditions remain unchanged from Figure 2. The D2D cooperation methods are benefited by improving the D2D link qualities through increasing the D2D bandwidth irrespective of the transmit CSI available at the AP or not. The gain is more prominent when increases from 1 to 2. Figure 6 compares the performances as the cluster location varies when = 1 and = 25 m. The antenna configuration is set to = 2 and = 1 in Figure 6(a) and = 4 and = 2 in Figure 6(b). The cluster center SNR is 0 dB at = ,max = 300 m for both figures and the cluster center SNR at distance is obtained from = ( ,max / ) ] .
The results reveal that the gain of D2D cooperation is larger when the cluster location gets farther from the AP since the quality of the D2D links relative to the DL gets better. This observation comes from the fact that the signals with larger amplitude suffer from larger quantization errors for the fixed number of quantization levels. The figure also shows that the effect of the compression noises becomes negligible when the device cluster is as small as that used in the figure.

Concluding Remarks
This paper considers an IoT network supporting both the centralized signal transmission and D2D communications. With asymmetric antenna configurations, we propose the Mobile Information Systems transmit and receive BF for the virtual MIMO formed by D2D cooperation. For the proposed method, we provide the strategies to improve the sum rate of the device pair and a distributive algorithm for device pairing and resource sharing to maximize the network throughput while meeting each device's desire. It is confirmed that the distributed algorithm works well in the network with transmit CSI also, providing a performance close to that obtained with optimal pairing. The transmit CSI improves the performance of the network significantly even without D2D cooperation but the D2D cooperation leverages the performance further. The performance gain of the D2D cooperation becomes more prominent when the D2D links are relatively better than the downlink by using a larger bandwidth for the D2D links and by applying the method to clusters of closely located devices but farther from the AP.

Introduction
As the research in wireless sensing systems start to mature with the support from various successful system deployments in a diverse set of application domains, researchers (and now the everyday users of these systems) have started to aim at higher research goals to make the system even more intelligent and suitable to the target applications. A natural next step request for systems that reliably collect sensing information is to autonomously make decisions on how the environment should change, with respect to the collected data [1]. For applications that do not require global knowledge, this is a plausible design choice. Furthermore, these systems should now add actuation units that perform these decisions that are made [2]. Although this may seem like a trivial change to the system, the addition of system intelligence and proper actuation units change the way we think of wireless sensing system development (and deployments) in a new direction [3]. Despite so, most recent work from the wireless sensor network research community still remains to put focus on diversifying its application scope with data collection-based systems (e.g., [4,5]) rather than detailing the requirements for wireless sensing and actuation system deployment, which, again, is an unavoidable next step. Specifically, on a technical perspective, the addition of actuation units in a wireless sensing system introduces the need for notifying the target actuators with the proper actions to take. For example, an air conditioning unit may need to change its target temperature level or modify the wind speed, or a lighting unit may be asked to change the illumination level to account for changes in the target environment. These types of actions not only require the packets in a wireless sensing system to travel bidirectionally (e.g., sensor to the gateway and back to an actuator), apart from the unidirectional data collection-centered approach of wireless sensor networks, it also requires further optimizations to maximize the efficiency of message transmissions. Issues arise not only at the protocol level, but also on a topology management perspective. The architecture of how nodes with different capabilities are logically connected can determine   the scalability, or even the usability of the system in different applications. Nevertheless, the answers to such issues are still unknown or have only been proven using simplified preliminary studies [6,7].
To provide answers to such questions raised by these new types of networked systems, in this work, we introduce a system architecture, where sensor nodes, actuation units, servers, and user devices interconnect (and interact) in a distributed manner to form a multitiered network. Specifically, hardware components in our system, the sensing unit, the actuation unit, the master (gateway) node, and the server, form a hierarchical network architecture, which is designed so that the amount of network traffic and packet delivery latency on the resource limited platforms are minimized. At the same time, this architecture allows for maximum scalability on an entire network's perspective. We generalize such a network architecture as a distributed wireless sensor and actuator network (DWSAN) and present a series of application domains where the DWSAN design can significantly benefit. Furthermore, using an intensive set of evaluations, we show that the distributed and hierarchical network architecture of DWSANs can significantly improve the efficiency of the wireless system.
Specifically, this paper makes the following contributions.
(i) We present a set of hardware implementations for designing a DWSAN, including sensor nodes and gateway nodes.
(ii) We present software techniques to effectively support DWSANs on a contextual basis. Specifically, we propose a new objective function for IETF RPL routing so take sensing similarity as a routing metric.
(iii) The proposed DWSAN protocols and hardware are evaluated through an extensive set of simulations and real implementations.
The remainder of this paper is structured as follows. In Section 2 we present the architecture of the DWSAN system and introduce the benefits of using such a multitiered, distributed networking architecture for various wireless sensing and actuation applications. We introduce the hardware and software components designed and implemented for the DWSAN system in Section 3 and evaluate the performance of our proposal in Section 4. Finally, we position our work along with some related work using Section 5 and conclude the paper in Section 6.

Multitiered DWSAN System Architecture
2.1. DWSAN Architecture. While many sensor network architectures exist, the main goals in designing the distributed wireless sensor and actuation network (DWSAN) system architecture can be summarized in threefold. First is to make sure that the network architecture is scalable to large numbers of sensors and actuation devices. Second is to make sure that the response latency of actuation units is minimized as the environmental sensing value changes are detected. Finally, despite the increasing number of data generation units (e.g., sensor and actuation nodes), the DWSAN system should minimize the amount of wireless traffic.
To achieve such goals, we design the DWSAN system as illustrated in Figure 1. Note that, compared to a traditional sensor network, instead of assigning sensors with the goal of reporting all their data to the server directly, the sensing units report their measurements to interested actuators. For example, in a smart building with automated HVAC and lighting, there is no need for the HVAC actuation unit to know the current illumination level in a target room, nor does the lighting need knowledge of the current room temperature. Furthermore, since real-time temperature data and illumination level data are (in most cases) not necessary at the server, these sensing values are sent to the HVAC actuators and lighting units, respectively, so that they can make their own actuation decisions locally. At the end of the day, as the actuators collect data from multiple sensors that are associated with itself, the actuator nodes send data "summaries" to the server so that the system manager can keep track of the environment.
In many previous systems [8], despite having different roles at heterogeneous nodes, all actuation decisions were made at the server and resent to the network towards a target actuator. Such process saturates the wireless channel quickly due to the server acting as the sink of all the data in the network. By taking a distributed approach, DWSAN tries to distribute the decision-making processes locally and, in turn, spatially distributes the packet loads as well.

Applications.
The following paragraphs introduce applications that can benefit from the use of the DWSAN system architecture.

Room-Level Lighting Applications.
Sensors that measure the current illumination level of rooms can be used to imply the comfort level with respect to the desired illumination level of the users. Based on this information, actuators such as the lighting system itself or the blinders on the windows can be interconnected to provide the desired level of light into the room environments. In addition to illumination sensors, other sensors such as passive infrared (PIR) and ultrasound sensors can be used to imply other activities in the room and configure the lighting level with respect to the activity that is occurring in the target environment.
HVAC Applications. By tightly configuring HVAC related actuation units with various wireless sensors, the HVAC system can react in a more fine-grain manner with respect to the real-time conditions of the environment. While a variant of such systems is already implemented in a centralized manner, modifying the architecture in a distributed way can maximize the efficiency of the system.
Other Environment Customization Applications. Besides lighting and HVAC, by using sensors that extract the activity context of a given environment, wireless sensing systems (with intelligence) can be used to estimate the users' intents. With such base systems, based on the conclusion of such decision-making units, various actuators can be used to control the environmental parameters in many aspects to meet the satisfaction level that the users desire.

Architectural Benefits.
The applications above can be implemented using current-day technologies. Nevertheless, DWSAN can provide a set of technical benefits. Using the following paragraphs we examine some of these benefits, of which we will later evaluate some in Section 4.
Improving the Actuation Response Latency. For many applications, the response time from detecting changes in the environment and performing the required actions to reconfigure the environment can be a critical factor. For example, changes in the illumination level would require quick adaptation of lighting units to assure that the room always stays at the desired illumination level. This is also true for many surveillance applications.
DWSANs configure the network so that nodes form clusters that interconnect related sensor nodes with contextually correlated actuators. Furthermore, sensors directly report their measurements to actuation units rather than a server. This naturally reduces the number of transmission hops and thus reduces the actuation latency. In other words, since actuators make decisions on their actuation in a distributed manner, the data does not need to be forwarded to the central server to reduce the time from the point of sensing the environmental changes to the time of actuation, allowing systems to react more rapidly.

Reducing the Wireless Traffic on the Shared/Limited Wireless
Medium. Most of the widely used wireless bands are shared by different applications, and even different wireless standards. To allow other systems to operate effectively, it is important to reduce the traffic overhead of the DWSAN system. The DWSAN system due to the reduced number of hops (or transmissions) it takes to deliver the sensing data to the final destination naturally reduces the amount of traffic that is put in the wireless medium. This allows room for other systems and protocols to interoperate well in the same geographical environment.
Maximizing the System's Scalability. Most environmental sensing applications can benefit by the use of more sensors deployed densely in a target environment. However, this is not easy to achieve due to the limited wireless bandwidth. By decreasing the number of transmitted packets, the system can use the remaining wireless capacity to support more nodes. Since centralized data gathering causes communication bottleneck, DWSAN distributes the role of decisionmaking to different devices to further increase the system's scalability.

System Design
We now present the hardware and software components that constitute the DWSAN system architecture.

Hardware Components.
In terms of hardware, DWSAN consists of four major components: (1) sensor module, (2) actuation module, (3) master gateway module, and the (4) back-end server component. The following paragraphs will present the components in the hardware in detail and discuss about the design choices for each major hardware component. We note that all our wireless components in these components use the Atmel RF212 radio on the 900 MHz band [9]. We select to use the sub-GHz range for our wireless communications since it provides us with a more stable communication range compared to the 2.4 GHz ISM band,  which is important, given the dynamics of various indoor environments.
Sensing Platform. For hardware designs, our primary focus was to design a generic sensing platform that could be applied to various indoor applications. Therefore, we designed a custom sensing platform as presented in Figure 2. Specifically, as sensing components, this platform includes a PIR sensor, CO and CO 2 sensors, an illumination sensor, and a set of temperature and humidity sensors. Based on our literature survey of indoor sensing applications, we decided that these sensors were the ideal set. Nevertheless, by exposing a set of additional GPIOs on the board, we allow the platform to be expandable towards other sensing modalities as well. The current version of this sensing platform used the Atmel 1284p microcontroller and provides buttons, LEDs, and USB connections.
Actuation Platform. Using the information collected from the sensing units, the role of the actuation platform is to control electronic (actuation) devices effectively. For example, if sensors in the environment report that the temperature of the room is higher than the target temperature, related actuation units (e.g., air conditioner or fan) try to configure themselves so that the target temperature is maintained. In reality, a diverse set of actuator types will exist, one (or more) for each application. One example of such application driven actuation unit would be the light control unit as Figure 3(a) shows. This unit is connected to the wall-mounted light controller to wirelessly manage lighting in a room scale.
Apart from such application-targeted devices, we also designed a more generic platform that controls the power state of the connected device (cf. Figures 3(b) and 3(c)). Using this electricity controlling unit the system can manage the power states of various appliances.
Master (Gateway) Node and Back-End Server. The gateway or master node of DWSAN, as presented in Figure 4, acts as the root of the IETF RPL network that we use to manage the nodes as a single network. This node connects the network of sensors and actuators to the server node. However, due to the distributed nature of DWSAN, the master node's main role is network management rather than data forwarding. The master node includes its own Atmel RF212 radio to connect with the wireless network and this module is attached to an Internet connected OpenWRT box to establish and maintain UDP/TCP connections with the back-end server, which is a simple PC for installing a lean-server design.
Back-End Server. The selection of the hardware components for the server was affected significantly by the DWSAN system architecture. Specifically, the distributed processing nature of DWSAN allows the system to utilize a lean-server: allowing us to install a simple PC scale server machine. While for many applications the processing overhead may not be a fundamental issue, with scaled-up systems, we argue that our DWSAN architecture reduces the potential of server-side overflow of processing. A detailed illustration of the server architecture is presented in Figure 5.

Software Components.
On the software perspective, the implementation of DWSAN includes a set of interesting features in which some are essential to maintain the distributed, multitiered architecture and some are implemented to further maximize the benefits that DWSAN introduces.
The following paragraphs discuss about these functionalities in detail.
Profile Generation and Distribution. In designing a distributed decision-making system, one of the most important factors to consider is the effective assignment of roles for all devices in the system. In fact, this role distribution and assignment process is the core of breathing in the necessary intelligence to a distributed system. In DWSAN, we allow system operators to input target sensing granularities and actuations using simple grammars and deliver this information to all nodes in the network using sensing/actuation profile messages. We follow the specifications of the CoAP application layer protocol [10] in designing the format of the profiles. Once these messages are formatted at the server, they are distributed via unicast to each node in the network. For each unicast, we request for a software level, end-to-end acknowledgment message to ensure that the packet was successfully sent. While we agree that this process of unicasting profiles to each node may be a burdensome process, we take such design choice due to the fact that the delivery of this profile message is, by far, one of the most important processes, since it determines how the sensor nodes should sample their sensors and how actuators should behave with respect to incoming sensing data. Despite the distribution itself being a time-consuming Mobile Information Systems    process, this happens only once when the system is initially installed. For additional nodes that are installed and updated during runtime, the server selectively sends profile messages to related nodes only; thus, the time and wireless channel capacity overhead can be reduced to be minimal.

RPL Compatible Network with Clusters.
In the proposed DWSAN system we utilize a standardized routing protocol for low-power and lossy networks (LLNs). Specifically, we find the IETF IPv6 routing protocol for LLNs, RPL [11], suitable for our network configurations due to three main reasons. First, using the RPL routing protocol forces the implementations to be standard-compliant at the networking and link adaptation layers. Second, by utilizing IPv6 addresses, DWSAN-based systems can easily connect with the larger Internet infrastructure and allow controlled access from external IPv6 machines. Lastly, RPL is designed to support bidirectional traffic patterns; thus, it allows point-to-point traffic to take place effectively and eases the development of a closed feedback-loop system. While utilizing RPL the way it is can support most of the functionalities the DWSANs require, the performance (or the network efficiency) may suffer. Sensor nodes that are associated with a given actuator node may not have direct routing path connectivity. In other words, while having physical single hop coverage, due to the RPL objective function (OF) (objective functions (OFs) in RPL determine how a next hop node in the multihop routing path should be selected; more details on this topic can be found in previous work by Ko et al. [12]), a packet may need to travel more hops to reach its (logically) associated actuation node. To resolve this inefficiency, we introduce a new RPL OF. Specifically, the Context Clustering OF defines a set of contextual targets and clusters nodes with similar (or related) contextual measurements. We do this by allowing nodes to overhear messages transmitted by potential parent nodes (via overhearing) and compare the trend of data changes using a simple cross-correlation algorithm. With correlations >0.7 we declare a high similarity on the two data/sensing streams. By specifying the supported sensor types on a sensor and the desired sensor measurement types at the actuator, next hop nodes in RPL are selected so that sensor nodes and actuator nodes with contextual similarity are clustered together. If possible (in the case where the node is within communication range), this actuator is selected as the next hop node for the sensors. Here, we note that sensors may select multiple actuators, as the same sensor data may be required at different actuators. In this case, sensing values can be sent either via unicast to each associated actuator sequentially or via multicast to all associated actuators  simultaneously. Overall, this new objective function observes the contextual similarity of data being collected at the sensors to determine the next hop node so that measurements can be suppressed as the packet traverses the network. By utilizing the Context Clustering objective function, we point out that the packet traverse path length from geographically relevant sensor nodes to target actuator nodes can be minimized. This naturally reduces the number of packet transmissions in a network and thus increases the scalability of the system.

Actuator
Collaborations. There are situations in which multiple actuators that form a single actuation system (or unit) need to collaborate together to perform the proper actuation required at the environment level. For example, multiple lighting units may need to collaborate to provide the proper illumination required in a room. For such cases, DWSAN allows the actuation units to form a subnetwork and select an internal master node. To ease the decision-making process in such scenarios, all related sensing data is forwarded to the head actuator in the subnetwork (which is predefined with respect to the application's target) and this head node sends actuation commands to control other actuators on how they should act. This subnetwork of actuators is configured in two different ways. First, the application administrator can specify groups using the profile messages discussed earlier. Or second, the nodes can compute the contextual similarity of the data they receive to autonomously compute related actuation nodes. In the current implementation of DWSAN, we select the first approach of explicitly selecting the master node using profile messages but plan to address the issue of autonomous group management as part of our future work. Specifically, potentially we can allow nodes to compute similarities and identify the most common parent node that shares the highest contextual similarity for this automation process.
Adaptive Sampling for Indoors Environments. On a data analysis perspective, gathering fine-grain data from a sensor network is always the best scenario. However, in sensor networks, nodes only hold a limited amount of resources in terms of energy, and bandwidth. Under such circumstances, sending all the data to the gateway is not a practical approach. With this in mind, many efforts have been made to reduce the number of transmitted sensing samples while still representing the original data properly.
A popular way to approach this issue is using compressive sensing [13,14], in which nodes gather information from other nodes in the same region to configure their sampling and transmission rates (e.g., using spatial correlation). While this is attractive, we point out that, in indoor environments, distributed sensors can show significantly different measurements with respect to their relative locations to the actuators (e.g., air conditioner, lighting sources), making spatial correlation difficult. Furthermore, since practical systems will not deploy tens of sensors in a single room, the correlation process becomes even more challenging. To exemplify, we plot in Figure 6 temperature readings from Sensirion SHT11 sensors at two locations in an office, one positioned close to the air conditioning unit and the other 10 meters away. Notice that while the two sensors show somewhat similar behaviors, differences in location and surrounding environment factors cause the plots to show differences in the amplitude variation and short term trends for some regions. This gives us a first-hand impression that using spatial correlation can be challenging in indoor environments and suggests the need for a data summarizing scheme that operates only based on the sensors' local data.  Besides using spatial correlation, there have also been attempts to summarize the original sensor stream using the data's temporal characteristics. These attempts, studied by many researchers in diverse domains, include adaptive/probabilistic models for data prediction which require a sufficient amount of computing resources at the nodes [15] and event/threshold-based data summarizing schemes [16]. In DWSANs we incorporate a similar approach to the latter. In our scheme, once a node collects a measurement from its sensor, it first passes this measurement through a low pass filter. While this low pass filter may smoothen some of the dynamics, we apply this filter so that small variations in the data will not trigger reports. Next, we compare the short term mean of the recent measurements, sample, with the most recent measurement report sent to the actuator, sample act . The difference, Δ, is computed and compared once again with a predefined threshold, Δ TH . Δ TH can be configured by the profile messages with respect to the data analysis algorithm's tolerable error-bounds for the summary reports. If Δ > Δ TH , we update sample act as sample and report sample to the actuator. If Δ ≤ Δ TH , since the current short term mean of the samples is within the predefined errorbounds, no reporting takes place. Using this simple scheme compliments the DWSAN architecture in minimizing the number of packets to transmit.
In addition, when no data points exceed the threshold, DWSAN sensor nodes also trigger reports using an multiplicatively increasing/decreasing trickle timer [17]. The timer' interval is increased multiplicatively on a periodic-basis until reaching an upper limit. At this upper limit the timer will stay at its upper state until an event occurs to reset the timer to a smaller value. At the beginning of each interval, nodes select a random time within [current time:current time+interval] to issue a measurement report. Even if reports are sent before the interval is over, the timer waits for the end of the interval to update the interval and then selects the next report time. When the Δ-based event detection scheme indicates that a report is sent (e.g., detected changes in the data), the timer's interval is decreased multiplicatively to send more reports to accurately track the dynamic periods while suppressing reports when there are only minimal changes in the data. Note that here, unlike the original trickle timer Per-node (hop) processing latency (ms) Figure 7: Sensing-to-actuator decision and response latency for DWSAN and a centrally managed network. design, instead of resetting the timer value upon events to the minimal, we take a more conservative approach of multiplicatively decreasing the timer value.
While this adaptive sampling functionality is optional (there is no operational issue to the DWSAN architecture when not using it) applications such as temperature and humidity management, or any application that deals with data with similar characteristics, can benefit significantly from using such sensor data suppression techniques.

Evaluation
In this section we present performance benefits that the DWSAN architecture brings along with the wireless traffic reductions that our proposed adaptive sampling scheme introduces.
We start by presenting our evaluations for the DWSAN system using both Matlab simulations and data collected from real hardware-based experiments. While we agree that simulations only partially capture the effect of real wireless conditions, we utilize simulations for experiments that require large-scale deployments. Matlab simulations were done in Simulink by mimicking similar network configurations as our target deployments later discussed.
For our evaluations we focus on three different performance metrics. Specifically, (1) we analyze the latency required to make an actuation decision at the actuator after a sensing sample being sent, (2) we measure the amount of wireless traffic generated in the network with various data reporting configurations, and finally, (3) we measure the scalability of the system in terms of packet reception ratio as the number of supported nodes grows in the network.

4.1.
Sensing-to-Actuation Response Latency. As mentioned in Section 2.3, the actuator's response latency to sensing measurements can be a critical factor for many applications. Using simulations, we show in Figure 7 the latency performance of the DWSAN network compared to a centralized networking approach.
We configured a 100-node network in a 100 × 100 meter grid (communication range set to 20 meters) while keeping 10% of the nodes as actuators with the rest being sensor nodes. Nodes' locations were set to be random and the transmission delay (actual time to deliver a packet over the wireless medium) was set to ∼3 ms (assuming a 100-byte packet on a 250 kbps link). Furthermore, since different applications require different per-hop packet processing, we varied the processing latency of each packet by 10, 20, and 40 ms. The gateway was set to be at the center of the random topology, which is the best-case scenario for the centralized scheme. On an operational perspective, sensor nodes send their measurements to the decision-making device (in DWSAN the associated actuator and in the centralized scheme the gateway) to measure the latency required in making actuation decisions. For fair comparisons, we neglect the random channel loss. Nevertheless, even when considered, it would not affect the relative performance of the two schemes.
Notice in Figure 7, where we plot the mean and standard deviation over 100 different random topologies, that DWSAN outperforms the centralized scheme by more than twofold. This trend increases further with increasing pernode processing latency. This suggests that DWSAN successfully delivers its packets to actuators effectively and the fact that they traverse over smaller number of hops reduces the decision-making latency. Note that DWSAN may encounter an increased level of overhead on a per-hop basis. Nevertheless, if we compare the 10 msec case for centralized and 20 msec case for DWSAN, DWSAN still shows better performance.

Analyzing the Required Wireless Packet Traffic.
Another important aspect is the usage of the wireless medium. With less packets in the air, eventually the network will be able to scale to more nodes and gain a more fine-grained view of the target environment. For this purpose, using the same environment as above, in Figure 8, we plot the average number of sensor samples (packets) sent in the network computed over 100 tests for a 60-minute test (simulated) duration each. In this experiment however, we try varying the number of actuators in the network. Specifically, we test for cases where 10, 15, 20, and 25% of the nodes in the network are actuators and the others are sensors. We note that sensors send messages at an interval of 1 packet per second (pps) while actuators report their actuation summaries to the server at 0.1 pps.
In Figure 8(a) we keep the location of the gateway at the center of the topology as before. Again, this is the most ideal case for the centralized scheme since it minimizes the path length for the packets. Even so, notice that DWSAN outperforms the centralized scheme slightly. This trend continues even when the number of actuators in the network increases. This is surprising given that the increased number of actuators should further minimize the path length and thus achieve a significantly lower packet transmission count for DWSAN. To explain this, we note the fact that the single hop communication radius was set to 20 meters while the nodes were deployed in a 100-meter × 100-meter area. Therefore, the field was relatively saturated and an increasing number of actuators would only give minimal impact in reducing the number of packet transmissions.
In Figure 8(b) we now vary the locations of the master (gateway) node randomly. As expected, compared to Figure 8(a), the number of packet transmissions for the centralized scheme increases by ∼40%, whereas the performance of DWSAN is not significantly impacted. Overall these results suggest that the DWSAN architecture can successfully minimize packet transmissions (e.g., channel saturation) effectively.

Impact of Adaptive Sampling.
We now expand this topic and introduce the impact of the adaptive sampling scheme (cf. Section 3.2) in reducing packet transmissions. We note that a preliminary version of the scheme below has been presented as part of reference [18]. For our evaluations, we perform a per-second measurement of temperature and humidity indoors for a week using the hardware introduced in Section 3.1.
We plot in Figure 9 the percentage of samples points required to reconstruct the original per-second-scale measurement when applying the adaptive sampling scheme with and without the trickle timer reports for different Δ TH . Quantitatively speaking, the original sample set consisted of ∼ 520 K samples for each sensor type. Using the simple adaptive sampling scheme we were able to represent the entire data stream using as low as 0.019% of the original data (e.g., 100 samples) for the temperature data when Δ TH = 1.0 ∘ C. We can also see a similar trend for humidity.
Given the high compression ratio, the next question is how accurate this compressed measurement is compared to the original high-rate data. For this Figure 10 compares the mean error of our scheme's summarized reports with a periodic scheme that uses the same number of reports but evenly distributed across the entire data set (Periodic: Adaptive). In other words, this second method represents a case where the samples were made at a periodic interval and the spacing between the samples was computed with respect to the number of samples from the adaptive scheme. For both schemes, at a given time, the system takes the most recent sample as the current temperature or humidity value. We compare the contextual values of both schemes against the actual fine-grained data collected at 1 sample per second. By comparing Periodic: Adaptive against Adaptive and Periodic: Adaptive + Timer against Adaptive + Timer in Figure 10, we can notice that the nonperiodic schemes (e.g., our propose schemes) show lower errors when compared to the original data set. This is mainly due to our proposed scheme effectively detecting regions where the measurements dynamically change.
Overall, as Figure 11 shows, our scheme effectively tracks and reports measurements (mostly) during the dynamic regions, thus, significantly increasing its accuracy while reducing the amount of traffic on the wireless medium.

Network Scalability.
Lastly, we examine the scalability of the system in terms of the packet reception ratio (PRR) observed at the gateway with increasing number of nodes in the network. In the same geographical setting as the other simulations, we increase the number of nodes in the network to as high as 400, while configuring 10% of these nodes as actuator nodes. We plot the mean PRR and the standard deviation computed over 100 independent runs in Figure 12. We take Δ TH = 0.4 and set the minimum trickle timer value to 500 msec with a maximum reporting interval of 60 mins. Notice in Figure 12(a) that, with a data rate of 1 pps, both the DWSAN and the centralized approach fail to provide quality service with >200 nodes. Nevertheless, by effectively reducing the packet transmissions, DWSAN achieves higher PRR than the centralized approach. The difference in PRR becomes more prominent in Figure 12(b) as we use a lower data rate of 0.4 pps at each node. In this case, we can see that DWSAN provides 90+% PRR for all cases and the performance degrades gracefully while the PRR of the centralized scheme degrades severely as the number of nodes exceeds 100.
To summarize our evaluations, we can easily notice that the distributed and multitiered nature of the DWSAN system architecture is effective in reducing the hop count towards the decision-making device and therefore reduces the sensing-to-actuation latency and minimizes the number of packets being transmitted in the wireless medium and this naturally leads to supporting mode wireless nodes in the same geographical area compared to the centralized approach.

4.5.
Real-World Deployments. The DWSAN system was deployed as part of two pilot studies. The first study, as illustrated in Figure 13, was made at our research institute. The purpose of this study was to study the effectiveness of our proposed distributed and multitiered network, while at the same time, we put focus on evaluating the application-side of the system as well. Specifically, we selected autonomous lighting as the application for our system, where heterogeneous sensors, such as passive infrared (PIR), microphone, and illumination sensors were used to determine the activity states in a meeting room environment [19]. This deployment consisted of eight sensing units connected to six actuation (LED lighting) units. Despite the small system configuration, where all nodes were physically connected via single hop, we were able to notice that the proposed DWSAN system architecture reduced the number of packets transmitted in the air by ∼50% compared to a IETF-RPL-based single-tier network architecture [12]. Deeper investigation on the traces revealed that ∼10% of this 50% was from the network architectural changes while the adaptive sampling contributed to the other 40%.
Based on the preliminary experiences from the internal testbed, we then performed a second pilot study in a 13story commercial building in the Gangnam District of Seoul, South Korea. Our deployments, again, as Figure 14 shows, were designed for a smart lighting application, covered two floors, and included 20 sensing units (e.g., a mixture of PIR, microphone, illumination, and power-usage monitoring sensors) with 14 LED dimmer actuation units. Compared to the internal testbed in Figure 13, this deployment was for an environment with more frequent human activity and not all nodes were reliably connected via single hop connections. As a result, we were able to notice the impact of the multitiered and distributed decision-making process of DWSANs to be even more effective. Specifically, we observed ∼65% reduction of network traffic by the use of the DWSAN architecture in this environment.

Related Work
Finally, we introduce a few previous works that deal with the efficient design of wireless sensor and actuator networks (WSANs). We introduce the contributions of these previous works and also discuss about their differences with the contributions made in this work with the DWSAN architecture.
In [7], Li discusses the concept of WSANs and prototype of a light sensing and actuating application for a home environment. This paper also introduces the benefits  of maintaining a layered networking approach, which is something similar to the multitiered approach introduced in this work. Nevertheless, the work by Li, being one of the first steps in introducing actuation to sensing systems, fails to evaluate the performance of the system in a large scale. On the other hand, we set up an extensive set of simulations to show the effectiveness of maintaining a distributed and multitiered WSAN.
Cañete et al. apply a service oriented approach to WSAN system designs [20]. This introduces interesting framework and middleware architectures with unique and useful characteristics. Nevertheless, compared to our work, the authors focus mostly on introducing the design of the system framework while we perform experiments that showcase the benefits of using DWSANs in a quantitative way.
In the work by Wu and Rowe [21], the authors introduce SAN-Logic, a lightweight logic-based programming paradigm, to enable the dynamic programmability and configuration of WSANs. The authors show that the use of this logic-based WSAN design can result in optimized network architectures that reduce the packet transmissions in the network. While this is an attractive approach and can be applied to various scenarios, compared to our proposed DWSAN, the resulting network is not standard-compliant nor can it easily and effectively account for changes in the topology or network environment characteristics.
Lastly, most similar to our work is the work presented by Boukerche et al. [6]. This work introduces the concept of centralized and decentralized networks for WSANs and discuss about the benefits of the two different approaches. To construct the decentralized network, the authors introduce the Periodic, Event-driven, and Query (PEQ) based protocol, which essentially ties geographically closely related sensor nodes to the nearest actuator node. The downside of this approach is, however, that the authors fail in discussing how sensors with different contextual information would interact. In our RPL-based approach discussed in Section 3.2, we account for heterogeneous sensor types by making the format of the objective function flexible enough, not to mention the standard-compliance with other RPL networks. Since the increase in sensing modalities and the interaction with various standard-compliant systems is unavoidable as WSANs gain more and more interest, the DWSAN architecture introduces a more advanced and expandable system architecture.

Summary
In this work we introduce the DWSAN system architecture designed for various indoors sensing and actuationrequiring applications. The DWSAN architecture exploits its distributed and multitiered characteristics to form a networked system that effectively reduces the packet transmission count and sensing-to-actuation latency, while maximizing the scalability of the network. We introduce software and hardware components that we designed for the DWSAN system and show using both simulations and real data traces from our custom hardware that the performance of DWSAN outperforms that of a centralized networking approach, which is still widely used in various monitoring and actuation applications. We believe that this study is an effort to finally diversify our thoughts on wireless sensing system architectures and quantitatively notice that, with new capabilities added to the wireless system, a new perspective of network architecture design is required.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Introduction
The software installed in an automotive electronic control unit (ECU) has increased significantly in size and complexity. This automotive software is responsible for features directly connected to the user's safety, such as the advanced driver assistance system (ADAS) or the adaptive cruise control (ACC). Automotive software is developed using the V-model development process, with corresponding verifying methods in each stage. The verification method is separated into model, software, processor, and hardware-in-the-loop (MiL, SiL, PiL, and HiL) tests, depending on the integration level of each stage. Among them, the HiL test is performed to check whether the functions meet the requirements through the interaction between the software mounted on the completely developed ECU and the peripheral hardware [1].
In the HiL test, the HiL simulator provides input to the target hardware, which is treated as a black box, and observes the corresponding results [2]. When the faults are detected in the test results, the developers should proceed to the debugging. Mostly, debugging the hardware is proceeded by using the interface such as joint test action group (JTAG) to monitor the internal behavior of the target. But, there are some limitations to obtaining the debugging information in the HiL test environment. First, when performing the test with the completed ECU form, it is hard to modify the hardware when debug interfaces, such as JTAG and the background debug mode (BDM), do not exist or are not exposed. Further, stopping the system arbitrarily, for example, to obtain debugging information, is limited. To interrupt the HiL test in progress at any time, both the ECU and the HiL simulator must be stopped at the same time. However, the HiL simulator is configured independently from the ECU; thus, there is a limit to simultaneously operating it at the systemclock level [3].
Meanwhile, the memory information collected periodically during the program execution can be utilized as debugging information. Embedded systems, such as the ECU, execute the actions defined in a program by loading data from memory and storing the updated data. These operations are performed using memory-interface instructions defined in the program [4,5]. In other words, observing the logs hat uses the memory can be useful for understanding the program operation. However, since the information generated regularly in the memory is large, techniques for transmitting the data are required. For example, over 3 MB of data is generated per second when 32 KB of data is generated every 10 ms. To provide the data generated during the HiL test to the developer, the following challenges should be considered. First, methods are needed for overcoming the network bandwidth limitation. The FlexRay, controller area network (CAN), and local interconnect network (LIN) can be utilized to transfer the large amount of memory data generated inside the ECU while executing the HiL test to the outside [6]. Among them, the FlexRay, which is not commonly used in the ECU, is unsuitable for general use. The LIN cannot be used for large data transmissions because of its low bandwidth: about 20 Kbps [7]. Therefore, the CAN protocol is appropriate for collecting massive memory data [8]. However, since the CAN's maximum transfer rate is 1 Mbps, the amount of data that can be transmitted in the above environment is 1.25 KB per 10 ms; that is, 96% of the data is lost. An additional technique is required to overcome this limitation.
Attempts have been made to overcome the limitations when transferring large amounts of data using CAN by modifying the CAN protocol or compressing and fragmenting data. However, existing methods are improper for the large data transmissions required in the HiL test environment. First of all, the technique for modifying the CAN protocol requires a revision in the communication driver to identify the modified protocol [9][10][11]. Secondly, the technique to compress the data can interrupt the existing operation because it inflicts an extra load on the processor. Moreover, when the compression ratio does not satisfy the requirement, the amount of uncompressed data can be lost or communication can be delayed [12][13][14][15][16]. Lastly, in the technique to fragment the data, since the ECU has a limited buffer size, the previously generated, unsaved data can be lost [17].
Meanwhile, by using the transfer technique in Figure 1(a), a large amount of memory data can be transmitted outside of the ECU using CAN by repeating the simulation. Due to the automotive software characteristics that take charge of the user's safety and require strict real-time operation, the variables for a specific function always operate deterministically [18]. Using this characteristic, the data generated in each separate region is sent by repeating the simulation. The size of each memory region is determined by considering the memory size, the communication environment, and the ECU operation period. The transferred data is stored in the database along with its order information. Using this technique, large amounts of data can be sent outside without requiring hardware modifications or interfering with existing operations.
When providing the memory data collected from the ECU to the developer and tester, it is more effective to provide a visualized image than text [19,20]. When the collected data is provided in text, they must check the values collected from the 32-KB memory generated every 10 ms. However, by providing an image, the developer can observe the updates of each memory address on a 256 × 128-entry image. Based on the information in the image, the developer can intuitively select a focus area. The collected memory data can be visualized as shown in Figure 1(b). The data collected periodically during the simulation, from 0 to , is stored in a database. The memory-update information in 1 can be acquired by comparing it with the data generated in 0 . The memoryupdate information in 2 -can be similarly acquired. By visualizing the acquired memory-update information, it is possible to monitor the operations in the memory via images during the program execution. In addition, it is possible to shorten the debugging time by eliminating unnecessary information that is not related to the operations. Therefore, in this paper, we propose techniques for collecting the large amount of memory data generated periodically in the ECU during the HiL test and visualizing the text data effectively. First, we segment the memory area into a size that can be transferred per unit time, considering the size of the memory and the communication environment. Then, we repeat the simulation for each segmented area to collect the data generated there. Lastly, we acquire the memory-update information from the collected data to visualize the update information in each unit time and the cumulative update frequency in each address. By using our proposed method, massive data generated from the ECU can be transmitted to the external database. The collected data can be useful information for the developers to understand the defective behavior of the target.
The rest of this paper is organized as follows. Section 2 explains the limitations to acquiring debugging information in the general HiL test environment and existing methods to overcome the CAN bandwidth. In Section 3, methods to collect memory data using the data-cascading method and the visualization of the collected data are explained. In Section 4, modules to process the proposed methods are described. Section 5 presents the implementation of the methods and the verification of the collected data. The conclusion is presented in Section 6.

Related Work
In this chapter, the automotive HiL test, the limitations that occur when acquiring debug information in an environment with restricted hardware modifications, and limitations in utilizing system resources are described. Then, to overcome the transmission limit of the CAN protocol, a conventional method for transferring a large volume of data and its limitations for applying them in the HiL test environment are explained.

Hardware-in-the-Loop (HiL) Testing.
Automotive software is developed in accordance with the V-model development process and has a variety of verification methods, depending on the level of system integration at each stage. Among them, the HiL test takes charge of testing the interaction between the software mounted on the fully developed ECU and the peripheral hardware. The reasons for performing the HiL test are as follows. First, by building a simulation environment, it is possible to reduce the costs incurred in initially establishing and then maintaining a realcar environment [21]. Furthermore, improving the test coverage, increasing the test reproducibility, and shortening the test time can be accomplished through test automation [22].
In general, the HiL test environment consists of a host PC, the HiL simulator, and the ECU. The host PC transfers the test scripts to the HiL simulator; they include use scenarios for running the completely developed software in an automobile according to the testing purpose. The HiL simulator receives the test script from the host PC via a high-speed network such as Ethernet and delivers the test inputs stated in the script to the ECU via I/O or communication interfaces. Lastly, the system under test is composed of the ECU and the peripheral hardware. The ECU in the system receives the inputs based on the test script, performs the operation according to the inputs, and delivers the output to the HiL simulator to determine the presence of faults.
However, obtaining debug information on the occurred faults has the following limitations. Firstly, it is impossible to use commonly used hardware debuggers, such as JTAG or BDM, when the interfaces are either absent or not exposed to the outside [22]. As the testing process proceeds to subject the completely developed ECU, modifications for appending a debugging interface are prohibited. When debugging ports are available to use, the test cannot be stopped arbitrarily to collect data because of the test script which includes the use scenario. To suspend the HiL test, not only the ECU, but also the HiL simulator must be stopped at the exact same time. However, the HiL simulator, which differs from an In-Circuit Emulator (ICE) or a software debugger, is configured independently of the ECU; thus, it is limited to simultaneously controlling both systems at the system-clock level. In addition, utilizing the source code to detect the behavior causing the fault is restricted, because the testing target is treated as a black box.
Meanwhile, embedded systems such as ECUs use memory-interface commands, such as MOVE, LOAD, or STORE, to acquire data stored in the memory for processing the operation and to store the data into memory, for example, RAM. In other words, since RAM is utilized to store the data changed during the operation stated in the test, periodically monitoring the data updates in memory can be used to trace the behavior of the program processed inside the ECU. Therefore, by monitoring the memory of the ECU during the program execution in the HiL test environment, it is possible to acquire debugging information related to the faults [4,5].

Transferring Large Automotive Data Using the CAN Bus.
The CAN (Controller Area Network) is a protocol that is commonly used in all ECUs, with a transmission rate from 25 Kbps up to 1 Mbps. However, such a transfer rate is optimized to share small but critical values, such as RPM (revolutions per minute) or temperature values [23]. Thus, in addition to the state information generated while performing conventional operations, another approach is needed to overcome the transmission limitations when transferring to the outside the large amount of data that is also generated during the operation, for example, additional information for debugging. The following techniques have been proposed for using the CAN to transfer big data.
Firstly, methods for improving the CAN protocol's bandwidth have been proposed. The CAN+ protocol increases the amount of data that can be sent in a message by setting the high-transfer rate in the gray zone existing between regions for synchronizing and sampling the data [9,10]. FastCAN overcomes the transfer limits by adding an extra bus to the network, which originally consisted of a single bus [11]. Further, it mediates messages via a new mediumaccess technique, allowing each node to efficiently access the bus. Although both approaches improve the transfer rate by 16 times or more, they cannot be utilized to transfer a large amount of data to the outside in the HiL test environment. When any field in the protocol is modified, except the data field, an overall conversion is required to allow the communication driver to recognize the data in that area. However, since the HiL test proceeds on the assumption that the manufacturing ECU is complete, hardware modifications are not appropriate.
Secondly, reducing the large volume of data by using a compression algorithm has been proposed. Wu et al. presented an algorithm which performs the delta compression to reduce the amount of data when it exceeds 5 bits [12]. Additionally, they suggested another approach to enhance the data compression ratio by rearranging different signals according to their characteristics [13]. In Miucic et al. ' research [14], additional control bits are used to compare the original data to the compressed data in order to send the smaller one. To enhance the algorithm proposed by Miucic, Kelkar suggested the method to diversify the parameters by utilizing the fields that represent the compression information [15]. In advance, the computation is reduced by positioning the compression information in front of the compressed data unlike the above [16]. However, applying such compression algorithms has the following restrictions in the HiL test environment. First, when the compression ratio does not satisfy the communication-bandwidth requirement, the surplus data can be lost. Furthermore, the existing test behavior can be interrupted since the operations for compressing data can be an additional burden on the processor.
Finally, a technique for dividing the data into a size that can be stored in a message has been proposed. The data is divided into segments that can be stored in a message, including the sequence information; then, each segment is transmitted [17]. Using the eight-byte data field in the CAN protocol, the data is stored in six bytes with two bytes of sequence information. In addition, the reserved bits in the control field are used to indicate fragmentation. However, this is not applicable to the HiL test, where a large amount of data is generated periodically inside the ECU. Embedded systems, such as ECUs, have limited storage buffers. Therefore, when monitoring the ECU while the HiL test is in progress and a large amount of memory data is repeatedly generated, there is a possibility that the data will overflow the storage buffer and be lost. Furthermore, when using reserved bits for sequence information, the communication driver must be changed to interpret the modified message, and this cannot be applied to the ECU in the HiL test.

Collecting Memory Data for Fault Visualization
This section explains the method for collecting large amounts of memory data periodically by using a data-cascading method in the ECU while executing the HiL test and the algorithm to calculate the memory-update information using the collected data. Moreover, methods for a developer to visualize the update information are suggested.

Data-Cascading Method for Collecting Extensive Memory Data.
To transfer the large amount of memory data generated by the ECU in the HiL test, limitations must be overcome, such as restrictions in utilizing system resources or modifying the hardware. In the data-cascading method, extensive data was successfully transmitted by running a simulation repeatedly with the same test scenario [24]. Figure 2 illustrates the process of collecting the memory data generated inside the ECU using the above method. For example, considering the operation cycle of the system's main task as 10 ms and the size of the memory to monitor periodically as ABC, the following process is executed to divide an area into the size of data that can be transferred in each cycle. First, the number of messages that can be sent in one cycle is calculated by dividing the operation cycle into the time required to send one message successfully. In the example, three messages can be sent in one operation cycle (A-1, A-2, and A-3). By considering the bus occupancy of the existing operation, it is possible to calculate the number of messages that can be sent without affecting the execution. The size of the area that can be transferred per operation cycle is calculated by multiplying the size of the storable data in a message, excluding the sequence information.
In Figure 2, the area to monitor is divided into three regions, and the amount of data that can be sent in one cycle is A. Therefore, to collect the data generated in all of the regions, the simulation must be repeated three times. In the first simulation, the data corresponding to region A is sent, and in the second simulation, the data corresponding to region B is sent. This process is repeated until all the data regions are sent to the database.
To collect the data periodically generated in the monitored area, as in Figure 2, the number of divided areas and the simulation repetitions can be calculated using the following information: (i) task : operation cycle of the system's main task; (ii) msg : minimum time to transfer a message; (iii) : number of messages that can be sent in task ( task / msg ); (iv) req : size of data generated in every task ; (v) msg : size of data stored in a message, excluding the order information; (vi) OCC: occupancy of the communication bus for the existing operation.
By considering the existing bus occupancy, the number of messages to send in each cycle is calculated as × msg × (1 − OCC). Finally, the number of area segments is calculated as Meanwhile, the size of the data field in the CAN data frame is eight bytes. To utilize it efficiently, it is necessary to define the proper protocols. Figure 3 shows three protocol types; their detailed explanations follow.
(i) Data-Request Protocol (RP). The data collector in the host PC generates a data-request packet (RP packet) to send the transfer environment and the area to request to the transfer agent in the ECU. The information field is set to 1 to indicate that it is an RP packet. In the rest of the fields, the total number 30 ms 20 ms 10 ms 0 ms 20 ms data 20 ms data 10 ms data 10 ms data 0 ms data 0 ms data ① ② ③ ① ② ③ Figure 2: Overall process of moving memory data from the ECU to an external database using a data-cascading algorithm.  of frames and the start and end addresses to monitor are included.
(ii) Cascading-Information Protocol (CIP). The transfer agent generates the cascading-information packet (CIP packet) to transfer the sequence information of the data currently transmitted. The information field is set to 1 to indicate that it is the CIP packet. The rest of the fields include the sequence information among the divided regions (offset) and the frame index of the frame to which the currently transmitted data belongs.
(iii) Data-Transfer Protocol (DP). The transfer agent generates a data-transfer packet (DP packet) to transfer the data with the sequence information in the divided regions. The information field is set to 0 to indicate that it is the DP packet.
In the rest of the fields, six bytes of successive data is stored, and their starting address is stored in the base address field.

Computing the Memory-Update Frequency for Fault Visualization.
Using the suggested method, the memory data generated in the ECU is collected to acquire the memoryupdate information. The memory data ( req ) is collected in every task during the execution time ( ). Equations (2) and (3) represent the address range and the frame index.
The frame is defined in (4), which represents the set of data generated in each cycle.
is a frame index} Value stored in address generated at time The memory-update information is calculated as in (5). By comparing two consecutive frames, the equation assigns 1 when the data has changed and 0 when the data has not changed, or it is the 0th frame.
Mobile Information Systems  Using the equation above, it is possible to obtain the memory-update information of the th frame. With this information, it is possible to calculate the cumulative update count for each memory address generated during execution, as expressed in 3.3. Visualizing the Memory-Update Frequency. The collected memory data is stored in frames, as shown in Figure 4(a). Figure 4(a) represents ,0 -,3 , which is the raw data stored in the database. By processing the raw data, ,1 -,3 , , can be obtained for visualization as Figures 4(b) and 4(c).

Visualizing
, in Each Time (Visualizing the Update Information in Each Frame).
, is visualized with the data acquired in (5). Except the 0th frame, the images are generated as the number of frames, and each image contains the information as ,1 -,3 in Figure 4(b). Each image shows whether an update occurred in each address. An address where no update occurred or randomly occurred can be checked differently to reduce the regions to monitor.

Visualizing
, (Visualizing the Total Update Frequency). , is visualized in a single image. As shown in Figure 4(c), the information is provided with an image to grasp the approximate update frequency by using different colors to represent the number of updates. With this image, it is possible to reduce the time spent debugging by preferentially selecting the memory regions to monitor.

Design of Modified Architecture in the HiL Test Environment
To collect and visualize the data generated from the ECU while processing the HiL test, the following modules have been designed. Figure 5 depicts the modified architecture from the general HiL test environment. In the host PC, the data collector and visualizer are installed. In the ECU, a transfer agent is installed to perform the functions. The transfer agent in the ECU and the data collector in the host PC are connected to each other, using the CAN to exchange packets. The visualizer generates the images with the data from the database. A detailed explanation of each module follows.

Data Collector.
The data collector is installed in the host PC and provides a UI (user interface) to receive the Input: CIP -Cascading information packet from transfer agent DP -Data transfer packet from transfer agent Output: V , ∼ V +5, -Values generated at th frame in address to + 5 if DP received then (6) ← DP.Base (7) [ ] ← DP.
[ ] (8) S t o r e sV offset+ , ∼ V offset+ +5, into the database (9) end if (10) end (11) end procedure  information from the user about the test environment and the memory area to monitor. In addition, it provides the following functions. First, it generates an RP packet based on the information requested by the user and sends the packet to the transfer agent. Second, it receives the DP packet and CIP packet from the transfer agent, parses the packets, and stores the data in the database. Finally, it acquires the memoryupdate information by comparing the data received and the data previously stored. The detailed algorithm to store the data from received packets is depicted in Algorithm 1. The data collector receives the CIP packet and DP packet from the transfer agent and parses them to acquire the data and sequence information to store in the database. Firstly, the offset and the frame index are acquired by analyzing the received CIP packet (lines (3)-(4)). After obtaining the sequence information, the data acquired from the DP packet is stored in the database. From the DP packet, six consecutive bytes of data and their starting address can be acquired (lines (6)-(8)).
Algorithm 2 shows an algorithm for calculating the update information by using the received data. First of all, when the data collector receives the CIP packet, it saves the value of the frame index into by paring the packet (line (4)). Then, receiving the following DP packets, it stores the consecutive six bytes of data and their starting address into (line (6)). If the value of equals 0, it assigns the value 0 from ,0 to +5,0 (lines (7)-(8)). If the value of k is greater than 0, it compares the received data to the data in the buffer to check the difference. When there is a difference, it assigns 1 to the area where the update occurred (lines (11)-(17)).

Transfer Agent.
The transfer agent is implemented as independent software in the ECU; it exchanges packets with the data collector in the host PC by the connection to the CAN. It receives the RP packet from the data collector to obtain the range of memory to monitor and the total frame count. After obtaining the information, it captures the data according to the request, produces the DP packet, and sends it to the data collector. In addition, it sends a CIP packet to transmit the sequence information of the transmitting data. The algorithm for the above functions is depicted in Algorithm 3.
The transfer agent receives the RP packet from the data collector and the environment information, such as task , msg , req , msg , and OCC. With the information, the transfer agent operates as follows. Firstly, it obtains the value of the frame index and the start and end addresses of the memory area to monitor from the DP packet (lines (3)- (5)). Then, it divides the memory according to the environment information. The number of regions to divide 8

Mobile Information Systems
Input: CIP -Cascading information packet from transfer agent DP -Data transfer packet from transfer agent Output: , ∼ +5, -The memory-update occurrence from address a to + 5 at th frame (1) procedure UpdateDecision (2) begin (3) b u ff e r [ ]: Temporary buffer for storing the data generated at -1th frame from address a to + 5 (4) ← CIP.Frame (5) while Receiving data transfer packet do (6) ← DP.Base (7) if == 0 then end if (16) end for (17) end if (18) buffer can be calculated as shown in Algorithm 4, which requires the values of task , msg , req , msg , and OCC (Algorithm 4, lines (3)-(4)). When the calculation finishes, it generates the CIP packet using the current frame index and the offset value and the DP packets included in the above sequence information (lines (10)- (13)). The packets are sent to the data collector; this process is repeated while increasing the offset and frame index.

Visualizer.
The visualizer generates images to display the data stored in the database. Two types of images are generated in the visualizer. First, it generates a list of images containing the updated information in each address in each frame. It uses , , calculated using (5), and generates the images for the number of frames to check the update information in every cycle. Areas where no updates occurred or areas unrelated to the operation can be filtered out. Second, it creates an image containing the total memory-information count in each address. It generates one image using , , calculated using (6). The image displays the update frequency in different colors to provide an intuitive understanding.
In the environment above, req is 29,400 bytes, which means that the amount of data generated in every task , which is 10 ms, is 29,400 bytes. Since CAN's maximal bandwidth is about 1 Mbps, while the data generated is about 3 MB per second, this causes a loss of about 96% of the data in each cycle. The number of memory regions to divide was calculated as 330, using (7). Therefore, we repeated the Input: RP -Data request packet from data collector task -Operation cycle of system main task P msg -The time required for sending one CAN message D msg -The size of data that can be stored in one CAN message OCC -CAN bus occupancy of the existing operation Output: CIP -Cascading information packet to data collector DP -Data transfer packet to data collector (1) procedure SendDataSegments DP.Base ← the address corresponding to offset (14) DP .data[ ] ← 6 bytes of successive data starting from the address of DP.Base (15) Send CIP and DP to the data collector simulation 330 times to collect the data generated in the entire region. and can be obtained using , task , req,start , and req,end , in (8) and (9).
= { | 0 ≤ < 29,400; is an address} (8) In the results shown in Figure 6(a), 400 frames of data, about 11,760,400 bytes, were collected successfully. About 85,956 updates occurred during the program execution; they were calculated using (5).
To verify the collected data, we used the Trace 32 hardware debugger [25]. Trace 32 is a tool that controls and monitors the target through a standard debug interface, such as JTAG or BDM, equipped inside the processor. It is possible to collect the variables, registers, and OS information by suspending the program during execution. However, there is a limit to collecting memory data during the HiL test because this process must stop the execution during the simulation. Therefore, to utilize Trace 32 for verification, we stopped the execution randomly and captured the data stored in the stack, which is a part of memory. We compared the captured stack with the data collected by the proposed method.

Input:
task -Operation cycle of system main task msg -The time required for sending one CAN message req -The size of memory to monitor msg -The size of data can be stored in one CAN message OCC -CAN bus occupancy of the existing operation Output: Seg -the number of segments to slice (1) function CalculatingSegments( task , msg , req , msg , OCC)    Figure 7(a), register R1 stores the value of the stack pointer, which is the starting address of the stack, and the right side of Figure 7(a) shows an image of the stack. Starting from address 40006BA4, data is stored on the stack every four bytes. Figure 7(b) is the set of data starting from 40006BA4 collected by the proposed method. We compared the data captured by Trace32 and the data collected by our method and found that they were identical. This means that the data collected by our method can be reliably used.

Application.
To visualize the data collected by the method proposed in this paper, we used the MFC (Microsoft Foundation Class) framework and the VTK (Visualization  Tool-Kit) library, which is open source based on OpenGL [26]. In addition, we utilized the GNU binary utilities to acquire information such as the stack or section, which can be generated by static analysis [27]. Using this information can reduce the area to check for debugging by filtering out unnecessary information, based on the symbol or section information mapped at each address and the memoryupdate frequency. Using the symbol information, we can distinguish the variables that are not necessary to understand the operation, such as the OS stack for which the memoryupdate frequency is meaningless or the buffer and timer which are used dynamically [5]. Therefore, the debugging time can be shortened by reducing the fault candidates in accordance with the symbol information. With the processed data mentioned above, it is possible to create two image types using the visualization methods proposed in this paper.

Visualizing
, (Visualizing Total Update Frequency). Figure 8 shows an image containing the total update frequency during the program execution for each address. Each value has a different color according to the number of updates. The higher number of updates has a color close to purple and red, and the lower number of updates has a color close to green and yellow. The areas where no updates occurred are white. The candidates to check for debugging can be reduced by filtering out the variables that have no relevance to the operation or where the update frequency is meaningless.
In Figure 8, A is related to the stack used in the OS, B is the buffer area, and C is the counter and timer area. Each area has characteristics, such as having dynamic values or being temporarily used as buffers, and the stored data value has meaning while its update information is meaningless. Therefore, these areas can be represented as white to reduce the fault candidates. By using the image in Figure 8, developers can select the focus area based on the total update frequency related to the operation.

Visualizing
, in Each Time (Visualizing Update Information in Each Frame). The memory-update information in each frame is visualized in Figure 9. The left side of the figure is a list of images containing the memory-update information that occurred in every cycle of the system's main task during the execution. Each image in the list can be expanded to the right side of the image. Gray means that it has no relevance to the operation or no updates that occurred during the execution. Red means an update that occurred in the selected time. White means that an update that occurred during execution, but not in the selected frame. With the list of images, a developer can understand the memory-update phenomenon during execution in the area from Figure 8 that the developer selected to focus. Using this information, the developer can grasp when a fault occurred. The data stored in each frame is also provided in the form of a table.
With the utilization above, developers have the following advantages for understanding the causes of faults generated during the HiL test. First, it is possible to catch both the time point when the fault occurred, using the memoryupdate information, and also visual information, such as the combination of symbols causing the faults, in environments where source code use is restricted, such as black-box testing. In addition, since the size of the memory data collected periodically is large, it is more effective to intuitively access the cause of the faults using images such as Figures 8 and 9 rather than that provided in the text.

Conclusion
In this paper, we proposed a method to collect large amounts of memory data generated inside the ECU by using CAN to monitor the behavior in the HiL test environment, where using a general debugger, utilizing additional system resources, and modifying the hardware are restricted. In the proposed method, we divided the memory into a size that could be sent during the operation cycle, considering the communication environment. Then, we repeated the simulation with the same test script to collect the entire data region, and the collected data was verified. In addition, we provided useful information to access the cause of faults by visualizing data obtained from the memory-update information.
A large amount of data could be transmitted using our method, which does not require hardware modifications or using a debugger and also does not affect the existing operation. Furthermore, in the HiL test environment, where using source code is limited, effective access to the fault was possible by utilizing the images containing the memoryupdate information. As a result, the debugging time was shortened, compared to finding the faults using text.
For future work, we will dynamically consider the bus occupancy to fully utilize the bus to enhance the transmitting efficiency. With the enhanced efficiency, the number of simulation repetitions can be reduced. Moreover, a more intuitive understanding can be provided when the symbol and section information assigned to each memory address is provided together with the images.

Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper. (IECON '12)

Introduction
The multicore architecture has penetrated into the industry because of its apparent advantage in aiding for higher performance. And more recently, a heterogeneous variant of this architecture has also found its way into the mainstream ARM's big.LITTLE [1] for various Internet-of-Things (IoT) applications as well as mobile devices. The popularity of big.LITTLE architectures is found in their better performance-power ratio and more efficient space utilization than their homogeneous counterparts [2]. These architectures consist of cores having different computing capacities but under the same instruction set architecture (ISA). They act as a way to conform to the variation witnessed in a typical workload sets, thereby helping maintain performance while saving power through aversion of core underutilization. Other than underutilization, harboring a fast thread on sluggish core would be penalized by a significant power consumption since it would take a long time for it to complete, as seen in [3]. The variation between core performance and energy footprint is the result of discrepancies between their features like cache size, frequency, issue width, inorder/out-order, and such [4][5][6][7][8][9]. Though this multicore architectures seem to be quite advantageous, designing an app littered to a multicore, let alone heterogeneous, architecture is made difficult because the developer is usually exposed to the low level details. Paper [10], proposes a developer friendly interface to harness GPUs and other multicore architectures. As such, we believe much of the logic for harnessing these architectures should be pushed to the OS scheduling. In [11], they even went as far as proposing their own experimental heterogeneous programming system.
These heterogeneous architectures are a new trend; as such they can render previously existent schedulers obsolete. The Completely Fair Scheduler (CFS) in Linux is one of them. CFS was introduced starting form Linux 2.6.23 as a replacement of the previous vanilla scheduler's SCHED_OTHER interactivity code [12]. CFS is widely chosen against other schedulers such as FIFO, in that it averts the problem of process starvation and assures fairness. CFS does not account core heterogeneity while making its scheduling decision. This can bear a far reaching consequence in terms of the energy and performance that can potentially be saved and improved, respectively. As CFS is the default scheduler in Linux, our focus will be towards its improvements in the arena of heterogeneous architectures.
This paper proposes Fastest-Thread-Fastest-Core (FTFC) dynamic thread scheduling mechanism that bases its mapping decision on conformity of running threads CPU utilization with the performance of available cores. It does this by dynamically and periodically measuring the CPU utilization of running threads so those threads that exhibit high CPU utilization are assigned to cores that can deliver high performance needed, while those threads with low CPU utilization are assigned to low performance cores. We measure the threads CPU utilization based on their CPU time; that is, the amount of time that thread spends executing on the core, as opposed to, for example, waiting for input/output (I/O) operations or entering low-power (idle) mode. The FTFC scheduler does its mapping using a new scheme which we coined as Binary Searched Mapping (BSM), which runs in ( log ) time complexity, where is the number of cores and is the number of type of cores.
This paper also introduces a new means of modeling heterogeneity based on the discrepancy between the performances of cores. We coined it as heterogeneity measure (HM) that describes how heterogeneous the system is. Using this measurement technique we are able to consider a whole spectrum of heterogeneous configurations. Through our experimentation, the FTFC scheduler saves power by up to 2.22% and speeds up processing by up to 52.62%, compared to CFS.
The rest of the paper is organized as follows. Section 2 elaborates on related works. Section 3 presents our heterogeneity measure. Section 4 describes the proposed scheduler. Section 5 shows the experimental environment and results. Finally, Section 6 concludes.

Related Work
Regarding multicore scheduling, several studies are found in literature. In [13], Koufaty et al. proposed a method of looking at the CPI stacks of the running applications to infer the core to which they would belong to. The HASS technique, proposed by Shelepov et al. [14], sought for efficient mapping by using program signatures which are embedded in the program binaries prior to execution. In [15], Chen and John used Weighted Euclidean Distance (WED) measure between the desired resource demand of a program and the core configurations to map a thread to an optimal core. Both techniques rely on prior offline profiling of the programs.
Most previous work relied on analysis of programs into their constitute subtasks, formulating constraint graph, and partitioning the graph to minimize the overall execution time [16] or the data transfer on buses [16,17]. Their approach requires the microarchitecture dependent performance/power data as a priori information for scheduling, whereas our approach does not need this. Chen and John [18] used fuzzy-logic based program-core suitability computation. This method is inapplicable when the number of core characteristics increases. This is because the complexity of their algorithm increases exponentially as the characteristics increases. Our approach is totally oblivious to the number or type of characteristics, as it only concerns the core performance.
Gulati et al. [19] explored the benefits exhibited from dynamic mapping of tasks to cores of flexible design. Kumar et al. [2] proposed a dynamic program scheduling scheme based on sampled EDP in subsequent runs. This is just a straightforward trial-and-error scheduling approach to find the appropriate match of core and thread. Becchi and Crowley [20] followed up on their work by using IPC ratios between two programs for program migrations. All these methods, as do ours, exploit intraprogram diversity and could adapt to program phase changes. In [21], a dynamic scheduling was proposed by tweaking core priority so as the OS load balancer does its job in a biased manner. This bias is formulated by taking identify key metrics.
The work in [22] proposes a static heuristic method for heterogeneous processors based on the use of an acyclic precedence graph. The use of this heuristic, however, requires a priori knowledge of the characteristics of the workloads. One of relatively recent papers that seek to address this mapping problem is investigated by Liu et al. [23]. They mathematically formulated the problem as a 0-1 integer linear program (Integer LP) and also proposed an iterative ( 2 / ) heuristic algorithm for solving it. We were able to do it in an ( log ) fashion.
We assume that each core is executing a single thread at most and assumed that the variations of the workloads rather than the type of workloads are more important when choosing the type of benchmarks. In [24], they refer to the variation of workload as more important metric when they designed their flexible multicore (FMC) architecture. Reference [25] advocates individual cores properties to be tailored to the different subsets of application characteristics as a main idea behind their designs.

Heterogeneity Measure
With the introduction of heterogeneity to architectures, the need to not only tell the heterogeneity but also the degree of the heterogeneity itself became apparent. We formulated it as the standard deviation between the normalized performances of individual cores. The performance of cores is once again modeled by total CPU utilization gotten from running four of PARSEC benchmarks serially on that core, core performance being directly proportional to the average CPU utilization. We define heterogeneity measure as where is the th core CPU performance. means one core is twice as fast as the other, where max is one of either 1 or 2 . HM can never be greater than or equal to 1. If it is 1, it means only one core is functioning. We use HM as an important metric to express heterogeneity of the configurations we used throughout the experimentation.

Fastest-Thread-Fastest-Core Scheduler
Unlike heuristic mapping methods proposed in [23], our method does not need to model the power, as it bases its methods based on the understanding of power consumption being dependent upon the independent variable performance. As such, we focus on performance optimization and let the dependent variable power follow up with its own optimization.
Before the FTFC scheduler takes over, six of the PARSEC benchmarks are launched and randomly assigned to one of the six available cores. Stage 1 of Figure 2 shows an example how it is done using four cores and four threads. signifies the normalized CPU performance. signifies the throughput of the threads on their respective core. The lifetimes of these threads are assigned arbitrary, the maximum being 30 seconds and the minimum being 5 seconds. Each thread was assigned the same priority as the other. The scheduling process ends as all threads end their execution. CPU utilization was monitored by measuring the CPU time. FTFC is the same as CFS except that FTFC considers heterogeneity and does dynamic thread migrations. When the configuration is homogeneous, that is, HM = 0, FTFC is essentially CFS, as there will not be any thread migrations. In such ways, FTFC can be seen as an extension of CFS.
Once all the threads are executed, the FTFC scheduler periodically monitors the CPU utilizations of the running threads. We set the time interval to be 1 ms. For such granularity, previous work [26] has shown cold cache effect and context migration costs can be brought to minimum. Some of the considerations that should be taken when doing migrations between cores with private L1s and L2s and shared L3 are, remembering the predictor state, allowing L2-L2 data transfers from inactive to active core and also maintaining the tags of the various L2 cache coherent. Transferring between registers and flushing a dirty private L1 data have a negligible performance overhead [26]. The FTFC scheduler can endure at most migrations, where is the number of cores. The diagrammatic illustration of FTFC scheduler is shown on Figure 1.

Homogenization.
Once the CPU utilization of a thread is gotten, it is homogenized. This process is necessary as it helps avoiding unfair comparison of CPU utilizations between threads running on small cores and those on bigger cores. "Small" and "Big" describe the relative processing power of the cores. By homogenizing we get the core-neutral CPU utilization value of the threads. It is done by multiplying the CPU utilization of a thread with a ratio of the normalized performance of the fastest core with the core it is running on ( max / = 1/ ). The homogenizing process is expressed as where is the normalized th core performance and is the CPU utilization of th thread. Stage 2 of Figure 2 shows the homogenizing process. We can see that thread 1 has same core utilization as thread 4 on stage 1. After the threads are homogenized thread 4 has lower utilization per core than thread 1. This means thread 4 was using less of the performance that core 1 could potentially deliver. The homogenizing process is needed to avoid static comparison of utilization values between cores, which is shown in the scenario of thread 1 and thread 4. The homogenization gives fairness in comparison of threads utilization according to the relative performance of cores.

4.2.
Normalization. The next step is to normalize the homogenized CPU utilization of the running sets of threads. By normalizing we know exactly which thread is the most CPU utilizing among the whole sets of threads, because we already homogenized it.
where is the th threads CPU utilization and max is the maximum CPU utilization among the threads.
Using this, we map the relatively high utilizing thread to the relatively high performance core, and the relatively low utilizing thread to the relatively low performance core. Using such relativity technique we ensure a complete abstraction of the hardware details of the core and program characteristics of a thread into a single variable of CPU performance and utilization value, respectively. In contrast to Performance Impact Estimation (PIE) proposed in [27], FTFC simplifies this estimation greatly. The relativistic nature of this mapping also ensures the avoidance of underutilization in a better way as compared to, for example, Faster-Core-First Scheduling as seen in [28], which aggressively assigns threads to the fastest core. If the core to be migrated upon is empty, the thread is simply migrated but if not we do the migration based on the to be migrated thread's utilization that is closest to CPU performance as compared to that of the already resident thread's.
Stage 3 of Figure 2 shows the normalizing process. Here, the thread utilization is normalized primarily to binary search it through an already sorted list of normalized core performances.

Binary Searched Mapping.
After the homogenizing and normalizing phases, the mapping phase takes over. Here, we propose to use binary searching mapping (BSM) to reduce time complexity and computation overhead. The mapping of threads is performed in a binary search manner and the already sorted list of CPU performance is searched for the CPU utilization value of the subject thread. This would ensure that the final entry at which searching terminates is the closest CPU performance to the threads utilization in question. This is done from lines 6 to 15 of Algorithm 1. This searching runs at (log ), where is the number of unique cores, and, in the worst case, that is, if all the cores except the last one of the core types are occupied, it takes times of iteration to  reach it. BSM ensures a complexity of ( log ) where is the number of distinct core types in the system and is the number of threads to be mapped. Though the scope of this paper is limited to heterogeneous multicore architectures in mobile devices, for the purpose of validating our mathematically gotten time complexity, we created dummy thread performance values and mapped them to a set of dummy cores but with the same HM always. We generated a set of cores ranging in numbers from 2 to 1000 and observed the trend of the postulated time complexity to the actual observed time. The results of this experiment can be seen in Figure 3. These findings can also suggest the significance of the BSM algorithm in an expanded scope of heterogeneous multicore processors as a whole.
As can be seen in Table 1, the BSM algorithm with the FTFC scheduler outperforms the others. The WED algorithm requires the offline profiling of applications based on their core complementary characteristics, which is another encumbrance to the compiling process and a loom on its scalability. The naive approach to this mapping problem runs at ( !) and efficient mapping is guaranteed (mapping wise), but it is too computationally expensive to be used, especially in resource constrained mobile devices. MTS runs at ( 2 / ). The main drawback of the MTS heuristic approach is that it does not completely exhaust all possible swaps before it terminates. Its swaps are defined and prioritized by the ratio of the change in power to the change in throughput that the swap could produce. This leaves out combinations of swaps that could be possibly valuable, thus not always resulting in optimized mapping.

Results
To conduct our experiments, we used ESESC [29] simulator which supports the heterogeneous configurations and the multitudinousness we needed. The ESESC (enhanced SESC) simulator uses Time Based Sampling (TBS), which they deemed to be ideal when working with multithreaded executions. As seen in Figure 1, the workloads we used for the experiment consist of six PARSEC benchmarks [30], namely, swaptions, canneal, fluidanimate, lu_cb, blackscholes, and freqmine. All these benchmarks were used in their medium sized inputs. We had to change these benchmarks into libraries prior to our simulation since ESESC supports executing one program at a time that is able to initiate multiple threads which can run independent of each other on separate cores. We had two binaries, one for FTFC and the other for CFS. These schedulers take turns to run and spawn all the six benchmarks from the PARSEC libraries they included, during which their power consumption and throughput are monitored and stored. One thread is allowed to execute for a maximum of 30 seconds and a minimum of 5 seconds. We took 30 samples in this manner and compared the difference. We define core heterogeneity as any inconsistencies between complementary features of cores, such as microarchitectures, frequency/voltage, or any attributes that can cause performance differences. One of the important program characteristics are instruction level parallelism (ILP), data locality, and branch predictability. The respective attributes for cores are Issue Width, Data Cache Size, and Branch Predictor Size. As such, we achieved core heterogeneity based on variations of these three attributes in addition to the ROB Size. We assume that these four attributes produce significant core performance discrepancy enough to create the arbitrary heterogeneity needed. All other attributes, including the frequency and voltage, were maintained constant. Voltage and frequency were kept constant for each core type. The ESESC simulator allows the modification of various attributes of the cores except the frequency or voltage. As such, there was no problem maintaining these, as it runs on their default static configurations.
We assume each core is executing a single thread at most and assumed the variations of the workloads rather than the type of workloads are more important when choosing the type of benchmarks. Our experiments contained two types of cores sharing a common L3 cache of size 3 Mb and having a dedicated L2 cache of the same or different size. There were three caches of each core type.
To create the configuration of required heterogeneity, we first made a core generator script whose sole purpose was to create a set of cores with random variations of their prominent characteristics. We generated 2000 unique core types by randomly varying the Issue Width, L2 Cache, Branch Predict size, and ROB size. Issue width is 4th power from the range of 1 to 64. L2 Cache is a power of 2 ranging from 128 to 32768. Branch Predict and ROB size are a power of 2 from the range of 16 to 32768. This is unlike the method used in [31], where the instruction set of homogeneous processors is augmented via custom instructions.
The second phase was to assess the performance of generated cores. We did this by running four of the PARSEC benchmarks serially on each core type for a total of 60 seconds. We profiled each core type by calculating the total CPU utilization achieved by those running benchmarks. The benchmark selection is based on their proximity to the characteristics of workloads exhibited in mobile devices. The benchmarks we used for this purpose were swaptions, canneal, fluidanimate, and freqmine.
The third step was to select cores in such a way as to guarantee their combination results in the heterogeneity desired. We took those 2000 utilization values from the second step and used them to generate a set of cores that satisfy any heterogeneity we desired but under the roof of heterogeneity possible given the set of variations in the recorded utilizations. We generated six configurations with an incremental heterogeneity of 0.02, starting from HM = 0. We run six of the PARSEC benchmarks on these cores and used Linux's CFS and the proposed FTFC to schedule them and recorded the total power and throughput. All the six core configurations used can be seen in Table 2.
We compared the result of FTFC with that of CFS. We can see from the result in Figure 4    highest configuration, that is, HM = 0.1, FTFC saves power by 2.22% and improves performance by 52.62% as compared to CFS scheduler. Figure 5 shows the performance of FTFC as compared to CFS. The variations of workload utilization also play an important role for FTFC scheduler operation. We can see from Figures 4 and 5 that the power and throughput for FTFC are essentially the same as CFS when the system is in its homogeneous configuration; that is, HM = 0. This is because FTFC is built on top of CFS and zero heterogeneity reverts it to operate as CFS. The highest performance gain is gotten when the system was in its highest configuration. The FTFC generally maintains its advantage over the CFS in both its power and throughput across the heterogeneous configurations we investigated.

Conclusion
This paper presented a solution to thread mapping on heterogeneous configurations using thread-core performance complementarity based on their CPU utilization and core capacity. Core performance was modeled by getting the total CPU utilization of fixed set of benchmarks running on a core. We were able to achieve a power save of 2.22% and a performance ramp up of 52.62% compared to the CFS scheduler. We were able to do this with an improved time complexity of ( log ). As future work, we propose to include frequency/voltage to achieve higher heterogeneity measures.

Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.

Introduction
Recently, simultaneous wireless information and power transfer (SWIPT) have been envisioned for wireless relay networks [1][2][3]. The focuses on SWIPT for relay networks mainly include two relaying protocols, that is, time switching (TS) relaying and power splitting (PS) relaying [1]. In TS relaying, a relay harvests energy from a source-emitted RF signal and forwards source signal in a time-division manner. In PS relaying, a relay extracts energy for forwarding from its received source signal with PS operation. Generally, PS relaying can increase its spectrum efficiency by reducing the consumed time slots compared to that of TS relaying. In attempting to effectively utilize available time slots, SWIPT schemes with destination-aided energy flow (EF) have emerged [2][3][4]. In [2], the destination-aided EF was proposed for TS relaying protocol. In [3], an autonomous multiple-input multipleoutput (MIMO) relay employing PS-based energy receiver was investigated with destination-aided EF. For cognitive relay systems, the authors in [4] proposed sending EFs from destination to energy-constrained relay.
In the above mentioned destination-aided EF transmission schemes [2,3], only a single source-destination pair was considered. However, the demands for relay-assisted SWIPT with multiple source-destination pairs have emerged. For instance, authors in [5] proposed several power allocation schemes for a relay network, where multiple source-destination pairs communicate with each other via a common energy harvesting relay. In [6], the outage performances of SWIPT in large-scale networks with/without relay were analyzed, where multiple receivers harvest energy using PS technique. By employing multiple energy harvesting relays to assist multipair communications, the authors in [7] proposed distributed PS for SWIPT by using game theory. In [8], multiple relays were applied to assist communications from multiple base stations to multiple cell-edge users and PS-based energy receivers were deployed at multiple relays. For wireless power transfer enabled collaborative mobile clouds, cellular data distributing among multiple mobile users was investigated in [9]. It has shown that the energy-constrained relays need to harvest energy from ambient radio-frequency (RF) signals to support multipair communications in future dense heterogeneous networks [6][7][8][9][10].
One challenge in relay-assisted SWIPT is a limited amount of harvested energy. Due to path-loss and inefficient RF-to-DC conversion, only a small fraction of energy emitted by a source can be harvested at relay. In order to harvest energy efficiently, smart antennas were employed in SWIPT systems (see [11] and references therein). In [12], TS and PS receivers were proposed for SWIPT in MIMO broadcasting channels. A low-complexity antenna switching for MIMO relay networks with SWIPT was investigated in [13]. Moreover, multiple antennas were deployed at relay to harvest energy from EF emitted by a single destination [3]. Although MIMO techniques have improved SWIPT performance to a certain extent, harvesting enough energy cannot be guaranteed by a limited number of antennas.
Recently, massive MIMO techniques were applied to improve wireless transmission capacity by exploiting its large array gain [14,15]. A review of massive MIMO can be found in [16]. By employing linear signal processing [17], massive MIMO techniques have obtained increased signal-tointerference-plus-noise ratio (SINR) and power efficiency, which makes massive MIMO suitable to be deployed in practical SWIPT prototypes [18]. In [19], massive MIMO antennas were employed at a hybrid data-and-energy access point such that the minimum data rate among user nodes can be maximized whenever the number of antennas at the access point grows without bound. In [20], the authors proposed a low-complexity antenna partitioning algorithm for SWIPT systems with massive MIMO. Subjecting to a minimum harvested energy constraint, interference can be mitigated and throughput can be maximized [20]. Note that all the above mentioned works on SWIPT with massive MIMO were constrained in single-hop communications systems. To the best of our knowledge, only a few works have been conducted for multipair massive MIMO relay networks with SWIPT; for example, the authors in [21] proposed TS and PS relaying protocols for multiway amplify-and-forward (AF) relay networks, where information sharing among multiple users was aided by a massive MIMO relay.
In this paper, we propose a two-phase relaying protocol with destination-aided EFs for a decode-and-forward (DF) relay network, where multiple source-destination pairs communicate via a massive MIMO relay [10]. Different from the work of [7], where synchronization among multiple relays is required, we consider only a single relay with massive MIMO antennas and destination-aided EFs. This model has many interests, for example, in environmental sensor networks, Internet of things, and future dense heterogeneous networks, where interuser communications can be realized with the aid of an energy-constrained relay [3][4][5][6][7][8][9][10]. The main contributions of this paper can be summarized as follows: (i) The important performance metrics including asymptotic harvested energy and asymptotic achievable rate are derived for the case in which the number of relay antennas grows without bound.
(ii) We show that destination-aided EFs are beneficial for boosting energy harvesting and, hence, the achievable sum rate can be improved significantly with the aid of destination-aided EFs. The optimal destination transmission power that maximizes the achievable sum rate is derived in closed-form. The optimal PS factor that maximizes the achievable sum rate is derived in closed-form.
(iii) We also show that by using zero-forcing (ZF) processing and maximum-ratio combing/maximum-ratio transmission (MRC/MRT) at relay, energy leakage interference (ELI) and multipair interference (MI) can be cancelled completely as the number of relay antennas grows without bound. When the number of relay antennas is large, transmission powers of each source and each destination can be scaled inversely proportional to the number of relay antennas.
(iv) We show that asymptotic sum rate is neither convex nor concave with respect to PS and destination transmission power, such that conventional convex optimization methods cannot be applied. We propose a one-dimensional embedded bisection (EB) algorithm to jointly determine the optimal PS and destination transmission power, so that the sum rate can be improved significantly.
The rest of this paper is organized as follows. Section 2 describes the system model and presents ZF processing and MRC/MRT processing. Section 3 presents the asymptotic analysis of system performance. The one-dimensional EB algorithm is also presented in Section 3. Section 4 presents numerical results and discusses the system performance of the proposed schemes. Finally, Section 5 summarizes the contributions of this study.
Notation. The superscripts (⋅) , (⋅) * , and (⋅) represent the transpose, conjugate, and conjugate-transpose, respectively. E{⋅} and Var(⋅) stand for the expectation and variance operations, respectively. C × denotes the × complex space and CN(0, Σ) stands for the circularly symmetric complex Gaussian distribution with zero mean and variance matrix Σ. Unless otherwise stated, vectors and matrices are, respectively, represented by bold lowercase and uppercase letters. I is the × identity matrix, 0 × is the × zero matrix, and 0 is the × zero matrix. Tr(Σ) denotes trace operation of a matrix Σ.

System Model
A block diagram of the considered multipair massive MIMO DF relay network is depicted in Figure 1, where singleantenna equipped sources (S 1 , S 2 , . . . , S ) transmit their information signals to single-antenna equipped destinations (D 1 , D 2 , . . . , D ) via an -antenna equipped wirelesspowered massive MIMO relay (R). Without loss of generality, we assume that is larger than in this paper. We also Mobile Information Systems · · · . . . . . . assume that a direct link between the source and destination of each pair does not exist due to large path-loss or obstacles. All nodes work in half-duplex mode. Moreover, noises at relay's information detecting (ID) receiver and destinations' receivers are modeled as complex zero-mean additive noises, while noise power at relay's energy harvesting (EH) receiver is assumed to be small, so that it is neglected.
The relaying protocol consists of two time phases. In phase I, all source nodes transmit their information signals R for forwarding. Meanwhile, all destination nodes transmit their EFs to R. Denote the PS factor by ∈ (0, 1); the received signal power at R is split into a ratio of : 1 − for EH and ID processing, respectively. Denote the transmitted signals from the sources and destinations by x S ≜ [ S,1 S,2 ⋅ ⋅ ⋅ S, ] and x D ≜ [ D,1 D,2 ⋅ ⋅ ⋅ D, ] , respectively, where S, is the information signal transmitted by S and D, is the EF signal transmitted by D . We assume E{x S x S } = I and E {x D x D } = I . The received signal at the input of the ID receiver can be written as where G SR ∈ C × and G DR ∈ C × are the channel matrices from the sources and destinations to R, respectively, √ S and √ D are the average transmission powers of each source and each destination, respectively, and n R is the effective additive noise vector at R satisfying E {n R n R } = 2 R I . In practice, the effective noise can be defined as n R ≜ √ 1 − n R + n R , where n R and n R are additive noise before and after the passive PS, respectively. The channel matrices G SR and G DR are, respectively, modeled as where H SR ∼ CN × (0 × , I ⊗ I ) and H DR ∼ CN × (0 × , I ⊗ I ) are the small-scale fading matrices modeling independent fast fading and D SR and D DR are the × diagonal matrices modeling the distance-related path-loss. The th diagonal elements of D SR and D DR are denoted by SR, and DR, , respectively. Moreover, SR, and DR, are assumed to be constant over many channel coherence intervals as they change very slowly with time. Next, by exploiting reciprocity property of wireless channels, the channel matrix from R to the destinations in phase II can be defined as where H RD = H DR and D RD = D DR . At R, the harvested energy at the EH receiver can be expressed as where u is the time duration of each phase and is the RFto-DC conversion efficiency. Thanks to the harvested energy in phase I, the relay transmission power R = h / u can be expressed as In phase II, the relay forwards the decoded and processed signal, x R ∈ C ×1 , to the destinations by using the harvested energy during phase I. The received signals at the destinations can be written in a vector form as where n D is the additive noises at the destinations satisfying E {n D n D } = 2 D I .

Channel Estimation.
We adopt the minimum meansquared error (MMSE) method to obtain the channel state information (CSI) estimation. A part of coherence time is used for channel estimation. All the sources and destinations simultaneously send their pilot sequences of p symbols to R. The received signal at R is given by where p is the transmission power of each symbol, N p is the additive noise matrix whose elements have zero mean and variance 2 R,p . In addition, the th rows of Φ S ∈ C × p and Φ D ∈ C × p are the pilot sequences transmitted from S and D , respectively. It is required that p ≥ 2 , so that the pairwise orthogonality of the pilot sequences, that is, Φ S Φ S = I , Φ D Φ D = I , and Φ S Φ D = 0 , can be guaranteed.
The MMSE channel estimates of G SR and G RD are, respectively, given by [17] G SR = G SRDSR + 1 whereD SR ≜ (( 2 R,p D −1 SR / p p ) + I ) −1 andD DR ≜ (( 2 R,p D −1 DR / p p ) + I ) −1 . Denote the estimation error matrices of G SR and G DR by E SR and E SR , respectively. Then, we have SinceĜ SR , E SR ,Ĝ DR , and E DR are independent, we havê whereD SR andD DR are diagonal matrices whose th elements are given by 2 SR, ≜ p p 2 SR, /( p p SR, + 2 R,p ) and 2 DR, ≜ p p 2 DR, /( p p DR, + 2 R,p ), respectively. Due to wireless channel reciprocity,Ĝ RD =Ĝ DR is used as the MMSE channel estimate for G RD .

Linear
Processing at Relay. The relay employs linear processing for DF relaying. In phase I, R uses a linear receiver matrix W to separate the received signal into streams. With the linear receiver, the received signal becomes Then, the th element of r is used for detecting the signal transmitted from S , which can be expressed as where w and g SR, are the th columns of W and G SR , respectively. After detecting the received signal, the relay node applies linear precoding to the detected signal before broadcasting. With DF relaying, the relay first decodes the original source signal and then regenerates the signal [22]. Thus, the transmitted signal at R can be expressed as where A ∈ C × is a linear precoding matrix. From (6) and (12), the received signal at D can be written as where g DR, and a are the th columns of G DR and A, respectively, and D, is the th element of n D .

ZF Processing.
Since both the sources and destinations transmit their signals to R in phase I and R broadcasts the processed streams to the destinations in phase II, MI will affect both the received signals at R and destinations. For the received signal at R, MI is the term √(1 − ) S ∑ ̸ = w g SR, S, , whereas, for the received signal at D , MI is √ R ∑ ̸ = g DR, a S, . Furthermore, the received signal at R also suffers from ELI resulting from EFs transmitted from the destinations to R in phase I, which can be represented by √(1 − ) D w G DR x D . In such a case, the relay applies the ZF receiver and ZF precoding to process the signal. By ZF processing, both MI and ELI are removed via projecting each intended information stream onto the orthogonal complement of MI and ELI. To this end, perfect CSI is required at R. However, when only the estimate of CSI is available at R, MI and ELI still exist.
The ZF processing matrices at R are given by respectively. In (15), the normalization factor ZF is chosen to satisfy |x R | 2 = R . Therefore, we have [23]

MRC/MRT Processing.
Since ZF processing neglects the noise effect, the corresponding detection performance is poor when the signal-to-noise ratio (SNR) is low. By contrast, MRC/MRT processing neglects MI to maximize the received SNR. As a result, MRC/MRT processing achieves a better detection performance than that of ZF processing in the low SNR region at the cost of achieving a worse detection performance than that of ZF processing in the high SNR region. Similarly to ZF processing, MRC/MRT processes the received signal and amplifies the processed signal according to (10) and (12). The MRC receiver and MRT precoding matrices can be, respectively, expressed as [23] W = W MRC ≜Ĝ SR , Furthermore, the normalization factor for the MRT precoding is given by [23] MRT ≜ (Tr (Ĝ RDĜRD )) −1/2 .

Asymptotic Analysis
In this section, the asymptotic harvested energy is derived for the case in which → ∞ and is fixed. The asymptotic symmetric sum rate over i.i.d. fading is derived for the case in which → ∞, → ∞, and / is a constant.

Asymptotic Harvested Energy Analysis.
To begin with, the transmission powers at each source and each destination are, respectively, scaled inversely proportional to the number of antennas at R, namely, S = S / and D = D / , while keeping S and D fixed.

Proposition 1.
Assume that the number of the sourcedestination pairs is fixed, the harvested energy at the relay as → ∞ is given by h a.s.
→ denotes almost sure convergence.
Proof. See Appendix A.
To obtain further insights, by letting D SR = SR I and D DR = DR I , the harvested energy expression in (19) over i.i.d. fading can be asymptotically expressed as Remark 2. The asymptotic harvested energy in (20) is independent of fast fading components of wireless channels. Furthermore, the asymptotic harvested energy increases with respect to the source and destination transmission powers, respectively. Therefore, for an expected value of the harvested energy at the relay, the source transmission power can be tuned down by tuning up the destination transmission power and vice versa. Moreover, since the array gain increases monotonically with respect to , both the source and destination transmission powers can be scaled down inversely proportional to without any asymptotic performance degradation.

Using a Large Receive Antenna Array.
To cancel ELI, orthogonal projection can be applied in detecting the desired signal. However, projecting ELI into its orthogonal space may contaminate the desired signal. Note that if the subspace spanned by ELI is orthogonal to the desired signal's subspace when is large, the orthogonal projection scheme can work well. Thanks to the massive array implementation, the channel vectors of the desired signal and ELI become nearly orthogonal when grows large. Therefore, we adopt the linear receiver as an orthogonal projection of the ELI. Consequently, ELI can be cancelled significantly by using the linear receiver with a large value of . The main result is summarized in the following proposition.
Proof. See Appendix B.

Remark 4.
The result in Proposition 3 shows that ELI and MI can be cancelled completely when grows to infinity. The received signal at the relay after using linear processing includes only the desired signal. Therefore, the capacity of S → R link grows without bound as approaches infinity.

Using a Large Transmit Antenna Array
Proposition 5. Assume that the number of source-destination pairs is fixed and the transmission power at the relay is supplied by the harvested energy, the received signal at for decoding the relayed information signal as → ∞ is given by for ZF processing and D, for MRC/MRT processing.
Proof. See Appendix C.
To obtain further insights, the expressions in (23) and (24) Remark 6. By using linear processing at R, the received signal at the destination D includes only the desired signal. Therefore, the capacity of R → D link grows without bound, which is similar to that of S → R link. Furthermore, we can see that the noise and MI disappeared when grows to infinity. It can be clearly seen from (25) that the EFs emitted by the destinations provide gain for decoding the information signal. Moreover, the received desired signals in (25) are independent of small-scale fading of wireless channels.

Sum Rate and Harvested Energy Trade-Off.
The end-toend (e2e) achievable rate of the transmission link S → R → D is given by where SR, and RD, are the achievable rates of the transmission links S → R and R → D , respectively. From (11), the received signal used for detecting S, at the relay can be expressed as where the effective noisẽR , is given bỹ It can be seen that the "desired signal" and "effective noise" in (27) are uncorrelated. Based on the fact that the worst-case uncorrelated additive noise is independent Gaussian noise of the same variance (see [17] and references therein), the achievable rate can be obtained as where MI k , ELI k , and AN k represent the MI, ELI, and additive noise effect, respectively, and are, respectively, given by Similarly, considering (13), the achievable rate of the R → D link can be expressed as ) .

(33)
Although (33) implies that only statistical knowledge of the channel gains has been applied at D in decoding the information signal, the performance degradation compared to genie receiver is negligible for a large value of [17].

Theorem 7.
For a finite number of relay antennas, the achievable rate of the system with ZF processing over i.i.d. fading is given by , and the pre-log factor ≜ u /(2 u + p ) resulted from the pilot transmission and half-duplex relaying.
Proof. See Appendix D.
Remark 8. We can see from (34) that the achievable rates of the dual-hop links grow without bound as → ∞.
Therefore, the e2e performance can be always improved by using a more large antenna array when ZF processing is employed. When → ∞, → ∞, and / is a constant, the asymptotic sum rate over i.i.d. fading becomes where Mobile Information Systems 7 denote the asymptotic SINRs of the first-hop and second-hop for each source-destination pair, respectively. This expression shows that, although the SINR of the S → R link is decreased by ELI, the SINR of the R → D link can be improved by employing destination-aided EFs. Therefore, the e2e performance can be improved when it is limited by the R → D link. Furthermore, we can reduce the transmission power of each source and each destination proportional to 1/ , meanwhile maintaining a given quality-of-service.

Proposition 9.
With ZF processing, the asymptotic sum rate is neither convex nor concave with respect to destination transmission power and the optimal destination transmission power that achieves the allowable maximum asymptotic sum rate is given by Proof. See Appendix E.

Proposition 10.
With ZF processing, the asymptotic sum rate is concave with respect to and the optimal that achieves the allowable maximum asymptotic sum rate is given by Proof. See Appendix F. Theorem 11. With ZF processing, the asymptotic sum rate is neither convex nor concave with respect to ( , D ) and there is a unique optimal ( * , * D ) that achieves the allowable maximum asymptotic sum rate.
Proof. See Appendix G.
Since the asymptotic sum rate is neither convex nor concave with respect to ( , D ), the optimal ( * , * D ) cannot be obtained by applying conventional convex optimization methods [24]. However, Theorem 11 shows that the optimal ( * , * D ) is unique and Proposition 10 shows that the asymptotic sum rate is concave with respect to , whereas Proposition 9 shows that for any given the corresponding optimal ZF D is uniquely determined. Thus, the twodimensional searching for the optimal ( * , * D ) can proceed with a one-dimensional concave optimizing for * with the corresponding * D explicitly determined by (38). Since bisection is effective in searching optimal solution of a concave function [24], this paper proposes a one-dimensional EB algorithm to search the optimal ( * , * D ) with noting that the optimal * D is embedded in one-dimensional bisection. The proposed EB algorithm is presented in Algorithm 1.
To quantify the trade-off between the harvested energy and achievable sum rate, we solve for in (35) and substitute it into (20). The trade-off between the asymptotic harvested energy and sum rate with ZF processing can be derived as where 2 , 2 , 2 , and 2 are the same as those of Proposition 10 and Proof. See Appendix H.
Remark 13. We can see from (42) that, as → ∞, the achievable rates of the dual-hop links grow without bound.
Thus, the e2e performance can be always improved by using a more large antenna array when MRC/MRT processing is employed. When → ∞, → ∞, and / is constant, the asymptotic symmetric sum rate over i.i.d. fading is given by (1) Initialize ( , ) ⊂ (0, 1) and .

Proposition 14. With MRC/MRT processing, the asymptotic sum rate is neither convex nor concave with respect to destination transmission power and the optimal destination transmission power that achieves the allowable maximum asymptotic sum rate is given by
Proof. By following similar procedure as the proof of Proposition 9 in Appendix E, Proposition 14 can be reached.

Proposition 15.
With MRC/MRT processing, the asymptotic sum rate is concave with respect to and the optimal that achieves the allowable maximum asymptotic sum rate is given by Proof. By following similar procedure as the proof of Proposition 10 in Appendix F, Proposition 15 can be reached.

Theorem 16.
With MRC/MRT processing, the asymptotic sum rate is neither convex nor concave with respect to ( , D ) and there is a unique optimal ( * , * D ) that achieves the allowable maximum asymptotic sum rate.
Proof. By following similar procedure as the proof of Theorem 11 in Appendix G, Theorem 16 can be reached.
Since Propositions 14 and 15, and Theorem 16, have the similar expressions as those of Propositions 9 and 10, and Theorem 11, respectively, the one-dimensional EB algorithm can be applied to the case of MRC/MRT processing straightforwardly. Similarly to (40), the asymptotic harvested energy versus sum rate trade-off with MRC/MRT processing can be derived as where 4 , 4 , 4 , and 4 are the same as those of Proposition 15.

Numerical Results
This section presents some numerical results to verify the performance of the proposed protocol. For simplicity of illustration, the circuit power consumption and EH receiver sensitivity at the relay are ignored [21]. In all the illustrative examples, we choose = 10, p = 2 , u = 98, = 0.7, p = 30 dBm, and 2 R,p = 2 R = 2 D = 2 = −90 dBm. Furthermore, we set D SR = D DR = I to model i.i.d. fading, where = −65 dB corresponds to a path-loss related to a distance of about 10 m for the carrier frequency of 915 MHz.
In Figure 2, the trade-off between the harvested energy and achievable sum rate is investigated for ZF processing. gradually increasing from 40 to 10000. In Figure 2, the points of the trade-off curves corresponding to a higher harvested energy refer to a larger PS ratios and, consequently, this scenario leads to a higher sum rate. Nevertheless, as → 1, the achievable sum rates of all the fixed D drop down dramatically and approach zero. It can be seen from Figure 2 that the trade-off between the harvested energy and sum rate can be improved significantly by using a very large antenna array. For example, for D equal to 20 dBm and 30 dBm, destination-aided EFs can effectively improve the trade-off between harvested energy and sum rate. However, for D = 35 dBm, destination-aided EFs deteriorate the trade-off. Furthermore, since a fixed D can only achieve a part of the sum rate compared to that of ZF D , Figure 2 shows that the curves are overlapped partially, whereas ZF D achieves the highest trade-off between harvested energy and sum rate.
In Figure 3, we investigate the trade-off between the harvested energy and achievable sum rate for MRC/MRT processing. Four levels of destination-aided EFs, that is, D = 30 dBm, D = 20 dBm, D = 0 mW, and ZF D , are considered, respectively, whereas S is fixed at 25 dBm. The asymptotic trade-off curves are plotted by using (46). The trade-off curves for → ∞ are also presented in Figure 3. As can be seen from Figure 3, the trade-off curves for → ∞ serve as the trade-off bounds. When is a limited number, similar to the case of ZF processing, a larger PS ratio leads to a higher harvested energy as well as a higher achievable sum rate. Also, the achievable sum rate drops down dramatically as → 1.
In Figures 4 and 5, we investigate sum rate versus D for ZF processing and MRC/MRT processing, respectively. The curves in Figures 4 and 5  In Figures 6 and 7, we investigate sum rate versus PS factor for ZF processing and MRC/MRT processing, respectively. The results in Figures 6 and 7 verify the correctness of the analytical ZF and MRC/MRT , respectively. As can be seen from Figures 6 and 7, the optimal PS factor almost keeps unchanged with the variable for the considered same power budget. Furthermore, the optimal PS factor for ZF processing is almost the same as that for MRC/MRT processing. Moreover, Figures 6 and 7 verify that the sum rate is concave with respect to . Figures 6 and 7 also verify that the onedimensional EB algorithm achieves the allowable maximum sum rate. In Figures 8 and 9, average sum rates are plotted against source transmission power for ZF processing and MRC/MRT processing, respectively. The analytical curves are plotted by using (34) and (42) for ZF processing and MRC/MRT processing, respectively, whereas the exact sum rate curves are plotted by using Monte Carlo simulations. Figures 8 and 9 clearly show that the closed-form achievable rates in (34) and (42) are accurate in computing the sum rate based on channel statistics-aided decoding with a finite number of relay antennas. Furthermore, it can be clearly seen from Figures  8 and 9 that destination-aided EFs can improve the sum rate significantly. This verifies that ELI can be effectively cancelled out by massive MIMO processing. Moreover, the highest sum rate is achieved by the proposed one-dimensional EB algorithm. Note that ZF D and ZF D achieve the second highest sum rate in Figures 8 and 9, respectively, the correctness of Propositions 9 and 14 is verified, respectively. Also, the piecewise sum rate curves verify the correctness of Propositions 10 and 15. By comparing the curves of = 40 and = 10000, we observe that the gains of sum rate can be obtained by increasing the number of relay antennas while keeping the source transmission power in useful regions. Notably, Figures 8 and  9 show that the sum rate achieved by ZF processing is higher than that of MRC/MRT processing in the high S region, whereas both ZF processing and MRC/MRT processing achieve almost the same sum rate in the low S region.
The sum rate versus the number of the relay antennas is investigated in Figure 10 with S = 25 dBm. As increases, the corresponding sum rate increases accordingly. The curves in Figure 10 show that ZF processing achieves a higher sum rate than that of MRC/MRT processing with the considered power budget. Furthermore, in the whole region of the value of , the highest sum rate is achieved by the one-dimensional EB algorithm and the second highest sum rate is achieved by

Conclusion
A destination-aided SWIPT relaying protocol has been proposed for a multipair massive MIMO relay network, in which a PS relay employs linear processing to cancel MI and ELI. The expressions of asymptotic harvested energy and symmetric sum rate with massive MIMO relay have been derived in closed-form. The trade-off between asymptotic harvested energy and achievable sum rate has been quantified. The effect of destination-aided EFs on the sum rate has been investigated and our results reveal that destinationaided EFs can boost the energy harvesting, so that achievable sum rate can be significantly improved by destination-aided EFs. Meanwhile, the detrimental impact of ELI on sum rate performance can be cancelled out by using a massive MIMO relay with linear processing. It has shown that asymptotic sum rate is neither convex nor concave with respect to PS and destination transmission power and a one-dimensional EB algorithm has been proposed to obtain the optimal PS and destination transmission power. The significant sum rate improvement of the proposed scheme has been verified by numerical results.
where h SR, and h DR, are the th columns of H SR and H SR , respectively. Consequently, as → ∞, it can be shown that By substituting S = S / and D = D / into (4), the harvested energy at the EH receiver of R can be rewritten as Based on the fact of (A.2), whenever the number of antennas at the relay grows without bound, the harvested energy expression in (A.3) can be derived as

B. Proof of Proposition 3
(1) For ZF Processing. By substituting (9) and (14) into (10), the processed signal at the relay before amplification can be expressed as By using the law of large numbers, we have Therefore, as → ∞, the received and processed signal at the relay for detection satisfies The above expression can be rewritten in the element form as (B.4) Next, we consider the ELI term. Note that the vectorĝ SR, is independent of the vectorĝ DR, . With ZF processing, it can be shown by using the law of large numbers that Similarly, it can be shown that By substituting (B.4), the element form of (B.5), and the element form of (B.6) into (11), we arrive at (21).
(2) For MRC/MRT Processing. By using the law of large numbers, the desired signal in (11) converges to a deterministic value when the receive antenna count goes to infinity; that is, as → ∞, → 0 due to the fact thatĝ SR, and g DR, are statistically independent. Thus, the ELI converges to 0 as → ∞; that is, Since g SR, and n R are also statistically independent, as → ∞, it can be shown that Moreover, as → ∞, it can be shown that a.s. → 0. (C.5) By substituting (C.4) and (C.5) into (13), we obtain (23).  (C.9) By substituting (C.8) and (C.9) into (13), we obtain (24).
(i) Compute E{w g , }. Substituting (14) and (9) into W G SR , we have which shows that where E SR, is the th column of E SR . Since the zero-meanvalued E SR, is independent of w , we have E{w E SR, } = 0. Thus, . From (D.2) and (D.3), the variance of w g SR, can be expressed as (iii) Compute . Since w g SR, = w E SR, , for ̸ = , we have (D.6) (iv) Compute . From (31), the ELI associated with the ZF receiver can be expressed as . Similarly, we obtain ) . (D.9) Over i.i.d. fading, the achievable rate of the transmission link S → R can be expressed as SR, = log 2 (1 ) . (D.10) (2) Derive , . From (33), E{g DR, a }, Var(g DR, a ), and E{|g DR, a | 2 } are needed in computing RD, . By applying the procedures similarly to those in deriving SR, , we obtain , for ̸ = .

E. Proof of Proposition 9
The SINRs SR and RD in (35) can be rewritten as . From the point of view of practice, we assume that the CSI estimation error is small enough so that DR > 2 DR and 1 > 0 hold true. Since ZF is a linear transformation of D , we first search the optimal ZF instead of the optimal D . Note that SR and RD are decreasing and increasing functions with respect to D ∈ [0, ∞), respectively; we consider two cases of , respectively, in determining the optimal solution.
For the case of SR | D =0 ≤ RD | D =0 , we have S ≥ ZF / SR by reducing SR | D =0 ≤ RD | D =0 . In such a case, we have ≜ min( SR , RD ) = SR based on the monotonicities of SR and RD . Since SR is decreasing with respect to ZF (and D ), the optimal ZF that achieves the allowable maximum asymptotic sum rate as well as is obtained by the allowable minimum ZF , that is, by setting D = 0.
Since the logarithm term log 2 (1 + ) inherits the convexity/concavity of , determining the convexity/concavity of (35) with respect to D is equivalent to determining the convexity/concavity of = min( SR , RD ) with respect to D . For the case of SR | D =0 > RD | D =0 , it can be shown that RD is concave with respect to D ∈ (0, ZF D ] and SR is convex with respect to D ∈ ( ZF D , ∞). Thus, = min( SR , RD ) as well as the asymptotic sum rate is neither convex nor concave for such a case.

F. Proof of Proposition 10
The SINRs SR and RD in (35) can be rewritten as From the point of view of practice, we assume that the CSI estimation error is small enough so that SR > 2 SR , DR > 2 DR , 2 > 0, and 2 > 0 hold true. Since the allowable maximum asymptotic sum rate is achieved with the allowable maximum = min( SR,RD ), we continue the proof by searching that maximizes . It can be shown that SR and RD are decreasing and increasing functions with respect to , respectively. Furthermore, we have SR | =0 > RD | =0 and SR | =1 < RD | =1 . Thus, there is a single cross-point between SR and RD for ∈ (0, 1). Denote the cross-point by ZF ; it can be shown that = RD for ∈ (0, ZF ) and = SR for ∈ ( ZF , 1). Since SR and RD are concave with respect to ∈ (0, 1), it can be concluded that is concave with respect to . Furthermore, noting the monotonicities of SR and RD , the allowable maximum is achieved by the root of SR − RD = 0, which can be expressed as (39). This proves Proposition 10.

G. Proof of Theorem 11
We first prove that the asymptotic sum rate is neither convex nor concave with respect to ( , D ). Since logarithm does not change convexity/concavity, determining the convexity/concavity of (35) with respect to ( , D ) is equivalent to determining the convexity/concavity of ≜ min( SR , RD ) with respect to ( , D ). It can be shown that = SR exists with a nonzero probability, so that we choose to investigate the convexity/concavity of SR without loss of generality. The expression of SR in (36) can be rewritten as From the point of view of practice, we assume that the CSI estimation error is small enough so that SR > 2 SR and > 0 hold true. Then, the Hessian of SR with respect to ( , D ) can be expressed as ] . (G.2) Since 2 SR / 2 = −2 D /( + (1− ) D ) 3 < 0 and |̃| < 0, is neither positive definite nor negative definite. Therefore, SR is neither convex nor concave with respect to ( , D ).
For the proof of the uniqueness of the optimal ( , D ), we proceed by contradiction. Denote ( * , * D ) as an optimal point that achieves the maximum asymptotic sum rate. From Proposition 9, we know that * D = ZF D ( * ) is uniquely determined so that From Proposition 10, we know that * = Z ( * D ) is uniquely determined so that Suppose that there is another optimal point (̃ * ,̃ * D ) that achieves the maximum sum rate with̃ * ̸ = * and̃ * D ̸ = * D . From Proposition 9, we have sum ( * , ZF D (̃ * )) < sum ( * , ZF D ( * )) .

Introduction
The rapid development of information technology, for example, 3G/4G mobile communication technology and the Internet of Things (IoT), have largely changed our everyday lives. LoWPAN devices are more popularly developed for embedded applications in biomedical electronics, wireless sensor networks, and environment monitoring [1][2][3][4][5][6]. LoWPAN devices are indispensable for modern electronic applications, and numerous hardware/software techniques have been developed for drastically reducing functional power dissipation [7][8][9].
In order to reduce the amount of unused power from existing devices, engineers engaged in the development of battery power-supply device are facing huge pressure to reduce the consumption of power. Many methods for reducing power consumption of the system have been proposed [10][11][12]. They are therefore exploring component level dynamic power consumption, which is difficult to measure, especially for low-power devices used in IoT enabled products, because of the limited dynamic range, large measurement noise, and limited bandwidth. For example, most of the battery-powered devices have low-power sleep mode that consumes very little supply current such as less than 1 A, while the active mode usually requires more than 10 mA current. It is difficult to measure such a wide dynamic range of currents with a single measurement. Otherwise, the testing of these low-cost, low-power devices is a daunting task; since such devices are cost-sensitive, test cost is a major consideration [13,14].
For the development of low-power devices and applications, power (or current) measurement is important. Thus, in order to fulfill the requirements for power consumption reduction, precise dynamic current waveform measurement and debug are required. In this paper, a low-cost, real-time current measurement system is proposed which is capable of digitizing a wide range of current, classifying different abnormal status, and providing specific location of abnormal module and significantly shortening the development process. The system consists of a power measurement platform called PTone and a novel abnormal status diagnosis mechanism which is proposed based on PTone.
A power measurement platform called PTone can be used to detect the real-time power of LoWPAN devices and be able to determine the state of each module of DUT system which is meaningful to the R&D on LoWPAN devices/applications. By comparing the power consumption of well-performance device and DUT in corresponding state of the device system, the abnormal status can be estimated and detected. Our hardware implementation approach would be described in detail in Section 3.
In traditional abnormal status detection, the usual practice is just judging abnormal status or not. No approaches are developed to find accurate abnormal status locating and abnormal status classification for IoT devices. In order to classify different abnormal status, a novel abnormal status diagnosis mechanism based on sliding-window feature extraction (SWFE) is proposed based on PTone.
Although much research has been done in designing systems for the detection of abnormal status to perform feature selection, to the best of our knowledge, no specific approaches have been developed for IoT devices. The work in [15,16] developed a grained approach to perform feature selection, but enormously increasing computational complexity. A few attempts have been proposed. Wang et al. [17] developed a system to implement Bloom filters which avoids some of the computational complexity. Despite the improvement in complexity, effective Bloom filter implementation may not be possible on resource constrained embedded devices [18]. Additional approaches at improving computational complexity have been explored. McPAD [19] measures features using a sliding window to analyze pairs of bytes. References [20,21] also use a sliding-window approach to extract features from the payload. Although these improvements obtain reliable detection results, their approaches are not able to reduce computational complexity.
First, we design a work state detection algorithm (WSDA), which is able to approximate start and end points of active time in work state by continuously moving a sliding window on the observed waveform. According to the start and end points of active time in work state, we can extract the waveform series in the work state for further analysis instead of using the whole waveform. PTone chooses features based on unique features of each active period in work state of LoW-PAN devices, which allow us to extract the features of power waveform more accurately. Then the Support Vector Machine (SVM) is used to compare the profiles of measured waveform data series for classifying different abnormal status. Finally, we investigate whether PTone can accurately classify DUT system status based on the waveform features extracted above and verify classification accuracy through some experiments. The system considers different aspects of testing low-power device, which cannot be measured by the previous measurement due to the fact that they cannot detect the broadband low level of current waveform.
The remainder of this paper is organized as follows. Section 2 provides an overview of system description. Section 3 describes our hardware implementation approach for PTone in detail. Section 4 describes our abnormal status classification algorithm in detail and provides evaluation of the performance of our approach. Section 5 concludes the paper.

Overview.
Nowadays, there is strong request to design a real-time power measurement platform for LoWPAN devices, with applications to sensing in environments where it is difficult or impossible to change batteries and where the exact position of the devices might not be known. We propose a power measurement platform called PTone which can be used to detect the real-time power of LoWPAN devices and be able to determine the state of each module of DUT system which is meaningful to the R&D on LoWPAN devices/applications. By comparing the power consumption of well-performance device and DUT in corresponding state of the device system, the abnormal module can be estimated and detected.
PTone consists of a low-cost power detection hardware platform and a service (such as a laptop), as shown in Figure 1.
In order to reduce the power consumption, LoWPAN devices' wireless communication is typically divided into work state and sleep state. In work state, LoWPAN devices have several periods of active time: receiving beacon signal from a device, replying Acknowledgement (ACK) of the beacon signal, sending data packet to the devices, and receiving ACK of the transmitted data packet. In sleep state, LoWPAN devices enter low-power status to extend battery life.
When a LoWPAN device receives a wireless connection request from source device, LoWPAN device takes step into work state to send ACK and then keeps quiet for a short time to wait for the preparation of source device. Work states and sleep states constitute a period of wireless communication, as shown in Figure 2. We can evaluate the status of LoWPAN devices system according to the impact of power consumption on the power waveforms change because different components in the devices have unique power consumption.  However, there are three key technical challenges. The first technical challenge is to segment the power waveforms to identify start time and end time of the active period in work state. We studied the characteristics of typical power waveforms of different kind of LoWPAN devices. We observed that the power waveforms are periodic and power consumption has obvious significant increase which causes power pulses during each active time in work state. We also observed that the period of DUT system is stable, and the intervals and amplitude of power pulses also are regular. Based on this observation, we design a power waveform feature extraction algorithm which is able to approximate start and end points of active time in work state by continuously moving a sliding window on the observed waveforms.
The second technical challenge is to extract the feature of power waveforms for generating classification model for each status of the DUT system. As the power consumption of components is similar, it is hard to distinguish the noise signal in the waveforms which is introduced by various factors, such as circuit noise of the DUT and measuring platform. Some conventional features such as maximum peak power, RMS deviation, the intervals between power pulses, entropy, and the number of power pluses cannot be directly used for the classification. To address this challenge, we apply a Hampel filter on the power consumption waveforms to extract the valuable signals which contain the power change caused by system perform and detect and remove the outlier signal caused by digital circuit noise. Hampel filter computes the median value of the sample and its surrounding samples. It also estimates the standard deviation of each sample about its window median using the median absolute deviation which can used to remove abnormal point in the waveform series.
The third technical challenge is to compare the profiles of filtered waveforms, which is hard to realize in real-time based on Commercial-Off-The-Shelf (COTS) power detection platform which usually just provides real-time display or records of the power consumption, so a kind of low complexity algorithm for feature comparison needs to be designed. The mean values of power in each work state align with each other because the estimated value of start and end points are different from the true value which would lead to much higher errors in mean value. Moreover, the interval time of different work state is fluctuant over time, and there may be different interval time of work state in the same system status. Therefore, mean value of power and interval time of work state are not enough to identify the system statues. We need to combine them with more waveform features to classify system statuses.
To address this challenge, PTone uses extracted features to train a Support Vector Machine (SVM) and achieves the classification of multiple system status and uses maximum posterior probability (MAP) obtained by the SVM to estimate the system status. MAP can be used to obtain a point estimate of an unobserved quantity based on measured data.
The key novelty of this paper is on proposing the first lowcost power measure platform and an effective status detection method for LoWPAN DUT system. In this paper, we have shown that fine grained status detection method is possible by analyzing and processing measured power waveforms properly. We also have taken some experiments to prove that the techniques proposed in this paper can be used for the proposed low-cost power measure platform. Examples include the detection of two common faults of microcomputer control unit and two often encounter problems in radio frequency module in different type of LoWPAN DUTs. Our PTone techniques can be potentially used to detect a wider range of system status by building a more complete component power consumption model in the future.

Hardware Implementation
3.1. Hardware Overview. The block diagram of the proposed current measurement platform is shown in Figure 3. As seen from the figure, the block diagram mainly consists of five parts: (1) Analog signal detection circuit current of LoWPAN devices is usually very small, which makes it hard to measure the current directly. Therefore, a controllable gain amplifier is required to amplify the low voltage at the signal detection sensor output, which is represented by PGA in Figure 3. Then the analog signal amplified by PGA needs to be sampled and converts to digital signal by analog-to-digital converter circuit, which is represented by ADC in the figure.
In order to achieve wide dynamic range of gain, wide bandwidth, and low noise and high linearity and digitize a wide range of signals, it is necessary to set the value of reference voltage and input gain of PGA and output data rate of ADC to reasonable values by system control module represented by MCU in the figure. The functional characteristics of proposed system are shown in Table 1.

Programmable-Gain Instrumentation Amplifier (PGA).
In order to obtain suitable output voltage that can be digitized directly, PGA281 is used as the instrumentation amplifier with digital gain control in Figure 3. PGA281 offers excellent DC precision and long-term stability using modern technology with internal filters that minimize noise. The output stage connects to the low-voltage supply. The PGA is carried out by improved parallel feedback based Flipped Voltage Follower (FLV) to achieve wide dynamic range of gain, wide bandwidth, and low noise and high linearity.
The input voltage gain of PGA281 is in the range of 0.125 to 128 in binary steps. The output stage has a gain adjustment factor, 1 or 1.37, to adjust gain to the optimal value. Desired gain can be selected by controlling digital input G4:G0 and the gain of PGA is set to PGA .
In the hardware interface circuit of the proposed system, we use a resistor with resistance , which is very small and sensitive to the change of current. The voltage at the resistor is represented by , and the current across the resistor is given by The output voltage of ADCin the PGA is given by PGA stands for the gain of PGA.

Analog-to-Digital Converter (ADC).
Analog-to-digital converter circuit used ADS1259 to digitize the signals at PGA281 output. ADS1259 outputs 24 bits of conversion data in binary format. Combined with a signal amplifier, a highresolution, high-accuracy measurement system is formed that is capable of digitizing a wide range of signals.
Depending on different input values, the ADS1259 has corresponding output code ADC , a 24-bit hexadecimal number. If the voltage of input signal ADCin is larger than positive full-scale input REF , the output code of ADC is set to 7FFFFFh. ADCin is provided by the PGA and equal to the output voltage of PGA; namely, PGAout = ADCin . If the input voltage of ADC is less than − REF (2 23 /(2 23 − 1)), the output code of ADC is set to 800000H. And if the input voltage is in the interval of (− REF (2 23 /(2 23 − 1)), REF ), the input signal can be digitized to a dynamic output code, which can be expressed as According to Nyquist Theorem, the input analog signal can be recovered without distortion; thus voltage of input analog signal ADCin can be measured based on the value of ADC .

Microcontroller Unit (MCU).
In order to configure the value of reference voltage and input gain of PGA and output data rate of ADC and calculate the data obtained from ADC outputs, a low-power and high-speed MCU, namely, STM32L151, is used in the system control module of the proposed system. STM32L151 device has high-performance ARM core operating at a 32 MHz frequency, as well as highspeed embedded memories having up to 128 Kbytes of flash memory and up to 16 Kbytes of RAM. Moreover, the STM32L151 device contains standard and advanced communication interfaces, such as USARTs and USBs, which can transmit measured data to computer for record and further analysis. Due to the good performance and low-cost of STM32L151 device, the proposed system is well represented in the configuration of a variety of parameters of the system and data processing.

Power-Supply Module.
A power-supply module has been elaborated in order to satisfy the power requirement of PGA, ADC, and MCU.
The PGA281 requires three supply voltages: the highvoltage analog supply, the low-voltage output amplifier supplies, and the digital I/O supply. The high-voltage analog supplies, which include negative high-voltage supply (VSN) and positive high-voltage supply (VSP), power the high-voltage input stage section and are set to ±12 V in the proposed system. The voltage of USB is 5 V, which cannot provide ±12 V power supply, so that a DC-DC boosting circuit based on MC34063 which is a DC/DC converter is proposed. The power-supply module can provide stable power which improve the stability of the system.

Parameter Measured by the Proposed System.
The proposed system is developed as a low-cost, high-resolution, high-accuracy, real-time measurement system for the power LoWPAN devices. It can measure and estimate instantaneous current, average current, average power, and instantaneous power.
The instantaneous current of the DUT is measured in a sampling interval, which is given by DUTin ( ) stands for the current value in th sampling interval.
The average current value average is computed once per cycle , which can be calculated by ) .
The input impedance of PGA is very large and the impedance of is very small, which is usually set to 10 Ω; that is, PGAin ≫ . Therefore, the instantaneous current of DUT can be simplified to the following: Similarly, the average current value can be expressed as follows: Assuming the battery of DUT is ideal, and the voltage of the battery is not time-varying; that is, is a constant. The average power of the DUT is calculated once per cycle , which is shown as follows: Due to the fact that PGAin ≫ and → 0, the equation can be expressed as The instantaneous power is the power consumption of DUT in each sampling slot, which is given by The measured instantaneous power can be set to computer through Universal Asynchronous Receiver/Transmitter (UART) for further research.
In all equations, REF is the difference between the voltage values of positive reference absolute input and negative reference absolute input of ADC. , , and PGA are known parameters. ADC ( ) is a value measured by ADC in a sample interval, and according to ADC ( ), we can measure and calculate the value of the instantaneous current, average current, average power, and instantaneous power of DUT. The description of the symbols in the equations above is summarized in Table 2.
As shown in Figure 4, DUT is a kind of LoWPAN device, which periodically builds a wireless connection with another LoWPAN device. The proposed system is connected to the DUT via an interface reserved for power measurement. The measured data will be calculated and transmitted to computer for record and further analysis via USB cable. The battery can supply the power of DUT.

Measurement Accuracy.
We compare our measured waveforms with Keysight 34465A millimeter. Keysight 34465A millimeter has high sensitivity and accuracy, so that it can be used as the ground-truth of the measured waveforms and the qualification standard of the accuracy of our measured data. We can see that these values measured by our platform are close to 34465A millimeter. The result is shown in Figure 5(a).
We quantitatively evaluate the errors in Figure 5(b), which shows a cumulative distribution function (CDF) of the difference between PTone's measurement and the Keysight 34465A millimeter's measurement.
The CDF shows that the 90th percentile error is 4 mw. Our results further show that PTone's mean power measurement error is 0.37 mw. Since the average power in our experiments is 100 mw, on average, PTone estimates power to within 0.37% of its correct value.
We confirm the performance of our system through this experiment. Besides we can ensure the evident effectiveness of the proposed system to make it easier to measure instantaneous current, average current, instantaneous power, and average power of DUT, as well as checking for system status during the period of development of LoWPAN devices.

Power Consumption Waveforms Information.
LoWPAN devices contain a variety of components, such low-power MCU, RF circuits, power management module, and sensor module. In a real-world application, the sensor module often adopts COTS sensors, which have mature and stable power consumption model and have little interfere to the detection of DUT system status. MCU and RF circuits consume most of the power offered by power supply [22,23]. Each LoWPAN DUT uses the power in maintenance of system configuration and sends ACK or data packet during the work state. After this period, the DUT system reduces the workload of MCU and shuts down RF circuits and then enters the sleep state which has almost no electric energy loss.
The PTone continually monitors the power waveforms of the DUT in real time, and the information of the waveforms can be represented by the following equation: In the equation above, ( ) represents power consumption of MCU module and stands for the th sample point.
( ) represents power consumption of RF module. ( ) and ( ) make up a large proportion of the total power ( ) and the noise ( ) comes from random measurement noise and the power consumption of sensors and components in peripheral circuit and so on.
This section is organized as follows. First, we remove the noise in the measured waveform. Second, we detect the work state in the waveform which includes the start time and end time of active time. Third, we select and extract features based on the character of LoWPAN devices' power consumption. Finally, we use the extracted features to train a multiple classifier, which can detect and classify the status of LoWPAN devices accurately.

Noise Removal.
Even though the proportion of system noise in the total measured power waveform is not large, the noise cannot be ignored due to amplitude excitations of noise with large dynamic range, as shown in Figure 6(a). The goal of this preprocessing step is to dampen the noise signal and improve the signal-to-interference-and-noise ratio (SINR) of the ( ) and ( ) signal. The frequency of system noise is not regular and occurs in a random fashion. In order to minimize the influence of system noise to the following feature extraction and classification, we use an appropriate filter to remove noise in such a situation.
If we use a Butterworth low pass filter, a type of signal processing filter widely used to remove high frequency noise, the large amplitude noise would make the power waveform distortion, because the noise signal in the waveforms is introduced by various factors, such as test environment, circuit noise of the DUT, and measuring platform, and is not all of high frequency. To address this challenge, PTone applies a Hampel filter to the measured power waveforms, to detect and remove outliers. Hampel filter computes the median value of the sample and its surrounding samples by using a slipping window. It also estimates the standard deviation of each sample about its window median using the median absolute deviation. If a sample differs from the median by more than three standard deviations, it is replaced with the media, which can eliminate the outliers caused by noise [24,25].
The result shows that Hampel filter can remove the outliers effectively under the premise of retaining the valid data.

Work State Detection.
In order to detect work state and distinguish the active time and quiet time in work state, we need to detect the start points and end points of the active time work state of DUT system. We process the measured data in a causal system, which is different from traditional mean absolute deviation (MAD) algorithm. MAD algorithm is a kind of edge detection algorithm widely used in the digital image processing and not suitable for the time signals processing in this paper. To address this challenge, we propose a work state detection algorithm (WSDA). First, the algorithm calculates weighted mean absolute value in the latest sample points. The weighted coefficients of each point in the window width of are set according to the distance to the current point. The weighted mean absolute value (WMAV) is calculated using the following equation: In the equation, represents the distance to the current point in the slipping window, ( − ) represents the filtered set kth point as the start point . (7) return (8) procedure FINDCANDIDATES ( ) (9) For ← 0 to do (10) does not exceed the threshold th and power consumption waveforms do not change obviously in the interval. Third, in a real-world measurement, there may be more than one state point candidate on the rising edge of power waveforms, so we need to choose an appropriate candidate as the start point in each work state. Due to the requirement of the complete work state waveforms for feature extraction, it is inappropriate to use the mid-value of those candidates which is commonly used in statistics, and we chose the state points according to the following ways. Let { (1), (2), (3), . . . , ( )} represent the set of start point candidates obtained from stat point detection algorithm.
(1) represents the first start point candidate in chronological order.
represents number of candidates in the set.
WSDA compares the Δ ( ) values of each point in { (1), (2), (3), . . . , ( )} on the same rising edge. If the mean value of Δ ( ) from point ( ) to point ( ) is larger than threshold th , we define that point ( ) has stable growth subsequent waveforms as shown in the following equation: Then, we choose the minimum value in active time of work state within that interval, as shown in Figure 7. The figure shows that even though the filtered waveform has some outliers, our algorithm can detect the start point accurately in different work state.

Feature Selection and Extraction.
In order to differentiate between statuses of the DUT system based on the real power consumption, we need to extract some unique features that can classify different system status. And according to the start and end points of active time in work state, we can extract the waveform series in the work state for further analysis instead of using the whole waveform, which allows us to extract  [26][27][28]. However, the methods need to process a large number of data to find out appropriate features, which is hard to achieve in real-time and low-cost platform. Instead, PTone chooses the features based on unique features of each active period in work state of LoWPAN devices.
As shown in Figure 8, S1, S2, S3, and S4 represent the active time of receiving beacon for source node, sending ACK, sending data, and receiving ACK from source node, respectively. Work state includes S1, S2, S3, and S4 section and the quiet time between them.
represents the duration time of work state. represents the period of DUT system which includes one work state and one sleep state. And we observed that is fluctuant and increases with the increase of transmission data. We also observed that is stable even though it is controlled by MCU of the DUT system and in real-world application, is far greater than , and the change of has no great effect on . In the same experimental condition, if outliers are presented within in each DUT period, that is, the measured value of power consumption is either too large or too small, we usually may consider that abnormal configuration of DUT's MCU exists in the wake-up time control module or system clock of the MCU. Therefore, can be selected as a feature to evaluate the MCU performance of the DUT.
We also observed that the power waveforms in S2 and S3 have similar amplitude in each work state. If the DUTs do not perform well in radio frequency module, the amplitude of S2 and S3 may be different and not stable. So the difference of maximum value of S2 and S3 can be used to evaluate the status of radio frequency (module of DUT devices).
After many experiments and analysis, we carefully selected some distinctive features of LoWPAN devices to classify the system status of DUT as follows:

Waveform Features
Mean amplitude of S2 and S3 Root mean square deviation of S2 and S3 Mean amplitude of S1 and S4 Quiet time in work state System period PTone can detect and classify the abnormal system status of LoWPAN devices based on the features proposed above. Note that we only detect the abnormity which affects the performance of low-power wireless communication, and the abnormity is inconvenient and hard to be detected in traditional ways. The labeled features are collected in the following ways. We connect a LoWPAN DUT into our hardware platform. Then, we start a source device which could search the LoWPAN DUT and build wireless communication. Then, the LoWPAN device enters work state and sleep state periodically. Meanwhile, the real-time power waveform sequences will be measured by PTone. The work state, the start time, and end time of S1, S2, S3, and S4 and sleep state in the sequences can be detected by the work state detection algorithm (WSDA) proposed above. Then, PTone extracts the labeled features from the measured power waveform sequences based on the start and end point of each active time. Then, the features could be trained for the further analysis and classification of system status.

Classification and Evaluation.
In this section, we investigate whether PTone can accurately classify DUT system status based on the waveform features extracted above. Based on our experience in the R&D process of LoWPAN devices, we divided system abnormal status into several groups: two kinds of RF module abnormal cases and two kinds of MCU control system abnormal cases.
RF module abnormal cases include the case of abnormal power control and the case of insufficient RF power in certain active period of work state. MCU control system abnormal cases include the case of irregular interval of each active time in work state and the case of inappropriate DUT's system period, which is caused by crystal deviation, synchronization strategy bugs, and so on.
The existence of abnormal power control, which is represented by RF-1, undermines the network connectivity and even may lead to the failure of data transmission. A typical waveform of this status is shown in Figure 9(a). The amplitude of S2 is not smooth, which should be stable during the active period to have better performance according to the design of the devices. The power waveform indicates that the power control in the S2 period during work state is abnormal. The insufficient RF power, which is represented by RF-2, may reduce the coverage of wireless network and influence the use of LoWPAN devices. A typical waveform of this status is shown in Figure 9(b). Even though the amplitudes of S2 and S3 are smooth in each active time, the amplitudes have obvious difference, which should be similar during sending packet and sending ACK period according to our design.
The insufficient RF power in S3 causes this kind of abnormal power waveform.
The unstable interval of active time may be caused by inaccurate configuration in MCU, incomprehensive estimation of time-delay in DUT system, and so on, which is represented by M-1. In LoWPAN, every device takes a time slot to receive beacon and send ACK or data package. Unstable interval of the active time can result in communication conflicts with other devices. A typical waveform of this status is shown in Figure 10(a).
The figure shows that the interval between S2 and S3 is much larger than that of well-performed devices. S3 is repeated two times in the same work state, which means that the excessive interval between S2 and S3 leads to unnecessary retransmission of packet data (S3). This phenomenon is difficult to be directly observed in the traditional test method.
Time synchronization is a key issue for any distributed system and is widely used in distributed wireless sensor networks, especially LoWPAN. So, if the system period is not stable and accurate, not only would power consumption be improved, but also the device would have a bad effect on the communication of other devices in the area with high node density. The case of abnormal system period is represented by M-2. A typical waveform of this status is shown in Figure 10(b). The figure shows that the system in M-2 does not have a suitable sleep state between each work state, and the device receives beacon and sends data all the time, which is not in accordance with the design of the devices.
The extracted features are used to train Support Vector Machine (SVM) and achieve the classification of multiple system status. We use maximum posterior probability (MAP) obtained by the SVM to estimate the system status. MAP can be used to obtain a point estimate of an unobserved quantity based on measured data. It is closely related to Fisher's method of maximum likelihood (ML) estimation but employs an augmented optimization objective which incorporates a prior distribution (that quantifies the additional information available through prior knowledge of a related event) over the quantity one wants to estimate [29][30][31][32].
We visualize the output of the classification as follows: we choose two of the features and calculate the maximum posterior probability (MAP) of each status in a 2D plane. The axes are the two features or combinations of the two features.
In Figure 11(a), the MAP of M-1 and M-2 status are estimated on the basis of two features, that is, the quiet time between S2 and S3 and system period. Due to the limit of using only two features, the SVM only can classify the system status into 3 kinds of cases, M-1, M-2, and uncertain status. The uncertain status includes RF-1, RF-2, normal status, and other unknown statuses. Every point in Figure 10 has a posterior probability of each system status. And the color of point is the MAP value of a system with those two features. If the MAP of system in certain status is larger than in others, we can classify the system in that status.
For example, if a DUT's quiet time between S2 and S3 is 60 ms, and system period is 700 ms, we can classify that the system status is in M-2 status, because the MAP of that point in M-2 status is more than 90%.
We also visualize the output of the classification by calculating the MAP based on different features to estimate system status. We classify RF-1 and RF-2 ( Figure 11(b)) and M-2 and RF-1 (Figure 11(c)) by using mean value of S2, quiet time between S2 and S3, and system period.
Note that we can accurately classify two system statuses by using two only two features.
In the actual course of application, we use more than two features to train the multiple classifier, which can detect different system status at the same time.
Our method in the real-world application is composed of offline step and online step. In the offline step, we collect waveform data of DUT in known statuses, extract the features, and train classifier. In the online step, we connect the DUT (in unknown status) into our platform. Then, the real-time power waveform of the DUT can be transmitted into a server, which is a Lenovo laptop with a Core i5 desktop CPU and 4 GB RAM. Then, the server calculates the MAP of each status based on the well-trained classifier and classifies the real-time system status based on the following rules. , M-1 , M-2 , RF-1 , and RF-2 represent the MAP of normal, M-1, M-2, RF-1, and RF-2 status, respectively. If the difference between the two largest values of , M-1 , M-2 , RF-1 , and RF-2 is more than a certain threshold MAP , the system is assigned to the status with largest MAP. If there are no such values, PTone classifies the system into uncertain status and waits for further data to calculate the MAP of each status. The threshold MAP depends on the number of detected statuses and the resolution of distinguishing different statuses. The larger MAP reduces the probability of wrong classification, but, at the same time, it also increases the number of data collected from the DUT and the time consumption of analyzing data. According to our measurements, MAP is set to 20% in our experiments.
To evaluate PTone's system status classification accuracy, we collect 200 power waveform sequences (with the same length) from different statuses (40 sequences in each status). We train a type of system status classifier based on some features proposed and extracted above. The average accuracy of classification and some typical multiple classification of system status are shown in Figure 12.
In Figure 12(a), the figure shows that the average accuracy of multiple classification is 89.5%. The classification of abnormal system period status (M-2) has higher accuracy (95%) because the time of sleep state between each work state, as the main feature of M-2, is longer than that of work state and can be easily distinguished. Meanwhile, the classification accuracy of RF-2 is relatively low (82.5%), because the amplitude difference between S2 and S3 in work states, which is one of the main features of RF-2, is usually very small and easily affected by noise. Even though there is some noise in the measured sequences and limited performance of PTone's server, our platform can still achieve an accuracy of 89.5%.
In Figure 12(b), the figure shows that the multiple classifier can calculate the MAP of each status. In the 1st measurement, the MAP of system in normal status is 85%, which is higher than that of M-1 (2%), M-2 (3%), RF-1 (3%), and RF-2 status (7%). According to the classification rule proposed above, − RF-2 > MAP , we can classify the DUT system in the 1st measurement into normal status. Similarly, the DUT system in the 2nd, 3rd, 4th, and 5th measurement can be classified into M-1, M-2, RF-1, and RF-2 status, respectively.
The results of the experiment show that PTone can detect and classify typical system at high classification rate by using different features extracted from power waveforms.

Conclusions
In this paper, a low-cost, instantaneous current/power waveform detection platform with desirable resolution and accuracy is proposed. Some power definitions, for example, average current, instantaneous current, average power, and instantaneous power, can be measured and calculated precisely. A novel and convenient testing method based on this system has been demonstrated for LoWPAN devices. The platform can be designed as a hermetically sealed module with economical components with small volume, which can be used in a variety of scenes whenever a wide range of small DC current is to be measured.

Introduction
We are experiencing explosive growth in digital data and connected devices today. The estimated number of connected devices connected to the Internet is predicted to reach approximately 6.4 billion in 2016 and 20.8 billion by the year 2020 according to Gartner. Since the inception of the term Internet of Things (IoT), which was intended originally for radio frequency identification (RFID) network, in 1999 by the Auto-ID Center of MIT [1], the concept has evolved into a network that is exponentially larger and more complex, supporting a large number of devices with services such as automotive, utilities, logistics, healthcare, and public safety using various technologies such as wireless sensor network, Wi-Fi, ZigBee, and cellular network to connect to the Internet backbone.
To realize the IoT revolution that will enable new solutions and business for consumers and entrepreneurs by connecting billions of physical world devices with varying capabilities, many major standards development organizations (SDOs) such as ITU-T, IETF, IEEE, and 3GPP have been very active on IoT standardization. In 2012, Recommendation ITU-T Y.2060, "Overview of the Internet of Things" was approved by ITU-T Study Group 13 (SG 13) that provides an IoT reference model and defines IoT as a global infrastructure for the information society, enabling advanced services by interconnecting things based on existing and evolving interoperable information and communications technologies [2]. IETF is actively working on a set of IoT standards focusing on IP transmission over NFC, Bluetooth, and WPAN [1]. IEEE P2413 WG is working on IoT architectural framework standardization that includes IoT domains, abstractions, and commonalities [1]. Finally, 3GPP is working on a narrowband evolution of LTE for IoT (NB-IoT) to be included in Rel. 13 that will further reduce cost and power but provide extended coverage compared to the LTE-M introduced in Rel. 12. Despite the different views on the IoT standardization by different SDOs, there is a general consensus on the technical challenges the IoT needs to solve. These challenges are to provide heterogeneous connectivity, ubiquitous coverage, reduced network and device complexity, enhanced power savings, and enhanced resource management. All these challenges are heavily impacted by the IoT network topology supported by massive number of connected devices.
Network science is a branch of science of interdisciplinary nature based on graph theory, statistical mechanics, data mining, and more [3]. Network science utilizes various models and tools to study complex networks' topology such as brain networks, social networks, and wireless networks. For decades, random network or Erdös-Rényi network has been the main network model for studying real world complex networks. A random network of nodes is constructed by connecting ( − 1)/2 node pairs if a randomly generated number is greater than a probability . Recently, the "smallworld effect" or "six degrees of separation" principle, which was first discovered by the social psychologist Milgram in [4], was first studied by Watts and Strogatz in [5]. In smallworld networks, by randomly reconnecting a small number of links in a regular lattice network, the average path length is reduced significantly [6][7][8][9][10][11][12]. Both the random network and small-world network have homogeneous network topology where the nodes have approximately the same number of links. In recent years, scientists have discovered that many real world networks such as world wide web (WWW), social networks, and metabolic networks are not random with node connection or edge distribution approximated by Poisson distribution but have power-law distribution [13][14][15][16][17][18][19][20][21]. In contrast to the Poisson distribution, the power-law distribution has higher peaks and "fat" tails describing the existence of few nodes with massive links observed in real networks. In 1999, Barabási and Albert proposed the scalefree network that has edge distribution equal to power-law nature. The two main features of the scale-free network are that it is an evolving network with incoming nodes and that these nodes are attached preferentially to the existing nodes with a large number of links.
Due to interdisciplinary nature of the network science, with heavy emphasis on graph theory, it is not easy to study the various tools provided by complex network models. There are numerous small-world and scale-free network introductory research papers from statistical mechanics branch, but there are very few introductory research papers from information and communications technology branch that can be used for wireless network optimization. Therefore, in this paper, we attempt to introduce basic concepts of graph theory, including small-world networks and scale-free networks, and provide system models with wireless channel characteristics that can be easily implemented to be used as a powerful tool in solving various research problems related to IoT. Furthermore, we evaluate the proposed system models of small-world networks and scale-free networks based on various complex network metrics, such as average path length and clustering coefficient.
The remainder of the paper is organized as follows. Section 2 presents recent research activities related to applications of small-world and scale-free concepts to IoT networks. Section 3 provides an overview on basic concepts related to complex networks. In Section 4, we describe the proposed system model for small-world networks and system model for scale-free network in Section 5. In Section 6, we present various network characteristics that are needed to be added to the conventional complex network models for IoT modeling. The numerical results are presented in Section 7. Finally, we conclude in Section 8.

Related Works
Small-world network and scale-free network models have been applied to various wireless networks, serving as different basis to the IoT platform, to solve various problems. We briefly describe recent research activities devoted to improving the wireless networks' performance based on the smallworld and scale-free concepts. One of the first known works that has applied the small-world concept to wireless networks research is the work done in [22] by Helmy. It was shown that the path length of wireless networks can be drastically reduced by rewiring a small number of links between the wireless nodes. Furthermore, a novel resource discovery method was proposed for small-world concept based wireless networks. Authors in [23] proposed to improve synchronization in wireless sensor networks by applying small-world concepts. The proposed model utilizes a small number of high end sensors (H-sensors) with long communication range in addition to the large number of low end sensors (L-sensors) in the network. Using a modified flooding time synchronization protocol in the proposed heterogeneous sensor network, a small-world topology is realized. Experimental results have shown that the synchronization error can be reduced by more than 50% compared to the conventional methods. Another important application of the small-world concept is in improving the energy efficiency in wireless networks. In designing an ad hoc network with improved energy efficiency, the authors in [24] used a new energy efficiency metric of the wireless nodes for the proposed small-world network. The small-world properties were verified through simulation analysis and the proposed model was shown to be more energy efficient than the conventional random network. One of the advantages of scale-free network is the robustness against node failures. In [25], the authors proposed two protocols for constructing scale-free networks. The two protocols are called preferential attachment with time varying feedback and opportunistic two-stage random attachment. Based on degree distribution analysis, it was shown that the proposed network model is scale-free. Authors in [26] proposed a new scale-free network that includes node removal process with compensation mechanism compared to the conventional Barabási and Albert (BA) network that only considers node addition step for node evolution process. The new model is called neighborhood log-on and log-off (NLL) model due to the similar mechanism as in the action of logging on and logging off a system. It was shown that the proposed NLL model achieves decreased average path length and enhanced network connectivity compared to the BA model. In [27], a new scale-free network called flow-aware scale-free (FASF) model is proposed. In this work, the wireless sensor network is modeled as a weighted network and the traffic of the nodes is modeled as the weight. The performance of the proposed model was compared to the BA and NLL models and was shown to achieve increased network lifetime and reduced average path length. Authors in [28] proposed a new shortcut strategy based on local importance of nodes. The proposed method was evaluated in a sensor network with regular sensor nodes and super sensor nodes and was shown to have small-world features. In [29], small-world network models were applied to mobile ad hoc network (MANET) and were implemented based on routing protocols in OPNET. The distance vector based routing protocols and link-state based routing protocols were compared and it was shown that link-state based protocols converge faster. Authors in [30] proposed an energy aware BA model for wireless sensor networks (WSN). The key idea of the proposed method is to consider both node degree and residual energy in preferential attachment. In [31], the authors proposed two protocols to construct energy-efficient and robust large-scale WSN with scale-free properties. The first scheme modifies the BA model by integrating clustering and degree constraint. The second scheme improves energy efficiency through avoiding links to hub-nodes with large potential degrees. The performance of the proposed protocols was verified based on scale-free property, average path length, energy efficiency, and network robustness.

Basic Concepts
Complex networks such as computer networks, sensor networks, brain networks, and social networks can be represented as a graph. A graph consists of vertices, nodes, or points connected by edges, links, or lines. Mathematically, a graph can be represented by ordered paired sets = { , }, where is a set of vertices and is a set of edges connecting elements of . Furthermore, the degree of a vertex is defined as the total number of edges connected to that vertex. Figure 1 shows a graph consisting of 7 vertices and 7 edges. The degree of node 4 is equal to = 3, since there are 3 edges connected to nodes 2, 3, and 5. The graph in Figure 1 can be described as a network with a set of vertices = {1, 2, 3, 4, 5, 6, 7} and a set of edges = {(1, 2), (1, 3), (2,4), (3,4), (4,5), (5,6), (6,7)}.
An important model of complex network is the random network or Erdös-Rényi network model. Random network model is commonly used as a network reference model to compare a newly proposed network model. A random network can be characterized by the total number of nodes and the probability that two nodes are connected. A random network can be constructed by the following procedure with nodes and connection probability .
Step 1. Start with a ring of nodes.
Step 3. If a randomly generated number between 0 and 1 is less than , then connect the two nodes and .
To evaluate the size of a network, the average path length is used and is defined as the average distance between two vertices, averaged over all possible pair of vertices. The distance between a pair of vertices is defined to be the number of minimum edges or hops connecting the two vertices. For example, the distance between vertices 1 and 5 in the network in Figure 1 is equal to 3. The average path length is determined as follows: where is the total number of vertices in the network, is the minimum distance between vertices and , and ( 2 ) represents all possible numbers of pairs of vertices. Figure 2 shows a ring of vertices with = 4 and = 3.  (1), the average distance is calculated as = (6) −1 × (1 + 1 + 1 + 1 + 1 + 1) = 1. The clustering coefficient is defined as the average fraction of pairs of neighbor vertices that are also neighbors of each other. The clustering coefficient measures the cliquishness of a typical friendship circle. The average clustering coefficient averaged over all vertices = 1 ⋅ ⋅ ⋅ is given by where is the clustering coefficient for vertex defined as where is the actual number of edges connecting the neighbors of vertex , is the total number of neighbor vertices connected to vertex , and ( − 1)/2 is the maximum number of possible connections between the neighbor vertices. Therefore, represents a ratio of actual neighbor node connections to the maximum possible connections. For the network shown in Figure 2, for vertex 1, 1 = 3 and 1 = 3, and the maximum number of connections between the neighbor vertices is equal to 3(3 − 1)/2 = 3. Thus, = 3/3 = 1.
The degree distribution is another important metric used for network topology analysis. It is defined as the probability that a randomly chosen node has a degree . A network's degree distribution depends on the total number of vertices in the network and the node connection probability . A random graph with large and has a degree distribution  that can be approximated as Poisson distribution represented as follows: where is the degree and ⟨ ⟩ is the average degree. Figure 3 shows the degree distribution for a random network with = 1000, = 0.01, and ⟨ ⟩ = 10. It is seen from the figure that the shape of the distribution has a peak around ⟨ ⟩ = 10 and falls off exponentially to the sides. In contrast to degree distribution of random networks and small-world networks following Poisson distribution, scale-free networks have degree distribution that follows power-law distribution defined as where is the degree and is the scale-free exponent. Many real networks have scale-free property with power-law distribution. For example, Internet, actor casting, and paper citation have power-law distribution with = 2.1, = 2.3, and = 3, respectively.

Small-World Network
In this section, we discuss the design and implementation issues related to modeling small-world networks through algorithm description, system model presentation, and metric calculation module overview.

4.1.
Algorithm. The small-world model has been actively applied to the communications networks research due to resulting network topology with features such as smaller average transmission delay and more robust network connectivity. The small-world network is constructed by randomly rewiring the edges of a ring lattice with nodes. The following procedure describes the basic steps of the small-world network construction. By varying the rewiring probability , one can analyze the transition of the network from a lattice structure to a random structure with 0 ≤ ≤ 1.
Step 1. Start with a ring of nodes.
Step 3. Reconnect the edges to a randomly chosen node with probability .
Step 4. Repeat Step 3 for all /2 edges in the ring network.

System
Model. Figure 4 shows the system model for implementing a small-world network. The system contains five major blocks: node initialization block, node connection block, rewiring block, average path length calculation block, and clustering coefficient calculation block. Parameters , , and correspond to total number of nodes, initial degree of all the nodes, and rewiring probability, respectively. Furthermore, the node connection matrix or adjacency matrix gives information about all the node connections after the completion of the rewiring process. The node connection matrix shown in equation (6) describes the node connections for the graph example in Figure 1.
Note that the small-world network is modeled by a relational graph where the distance is based on edges or hops rather than the absolute distance used in spatial graphs.

Metric Calculation Modules.
In average path length calculation block, based on the node connectional matrix, which contains all the connections between the nodes, the number of hops needed to reach a node from node needs to be calculated. The first step in this module is to find all possible node pair index ( , ). For all the node pairs indices, we start by checking if there is a direct connection between nodes and . If there is no direct connection, we check for 2-hop connection where node is connected through an intermediate node. We continue this process with increasing number of hops until all the numbers of connections for all the node pair indices have been found. Finally, all the numbers of hops found for all the node pairs are added and divided by the total number of node pairs. For a network with very large number of nodes, breadth-first search (BFS) algorithm [32] is recommended for average path length calculation. The basic idea of BFS algorithm is to label a reference node as "0" and then "ripple" the labeling process until all the nodes have been labeled. The labels provide the distance with reference to node 0. In clustering coefficient calculation block, based on the node connectional matrix, the total number of neighbor nodes connected to node is found. Using the number of neighbor nodes found, the maximum number of possible connections is calculated by ( − 1)/2. The next step is to find the actual number of edges connecting the neighbors of node . We continue this process for all nodes. Finally, we use (3) to calculate the clustering coefficient for node using the information found in the previous steps and get the final clustering coefficient using (2).

Scale-Free Network
In this section, we discuss the design and implementation issues related to modeling scale-free networks through algorithm description, system model presentation, and metric calculation module overview.

Algorithm.
Major features of scale-free networks that are different compared to random networks and small-world networks are dynamic addition of new nodes and preferential attachment to existing nodes with rich connections. Due to these features, in contrast to random networks and smallworld networks with Poisson distribution, scale-free networks have degree distribution following power-law nature, resulting in higher probability of finding nodes with a large number of links. The following algorithm shows the steps towards construction of a scale-free network.  Figure 5: Scale-free network system model.
Step 1. Start with a small number 0 nodes with degree .
Step 2. Introduce a new node into the network.
Step 3. Connect the new node to existing nodes based on maximum degree probability shown as follows: where Π( ) is the probability of selecting node , is the degree of node , and ∑ is the total number of edges in the current network.
Step 4. Repeat Steps 2 and 3, until a network with = + 0 nodes and edges has been constructed.

System
Model. The system model for constructing a scale-free network is shown in Figure 5. The main modules are initialization modules, node connection modules, and metric calculation modules. The system model starts by initially connecting 0 nodes with edges. The node connection matrix is generated based on the initial network topology. Based on the node connection matrix, an edge vector is created. We implement the preferential attachment step or Step 3 of the scale-free network algorithm by using the edge vector. The edge vector contains multiple node indices, where the number of index repetitions indicates the number of edges of that node. The edge vector based method is further described based on Figure 6 and equations (8), (9), (10), and (11).
In Figure 6, an initial graph is shown with 3 vertices and 2 edges for each vertex. Equations (8), (9), (10), and (11) show the connection matrix with connection information of the initial graph topology. Furthermore, an edge vector includes node indices of all the nodes in the graph in multiples of twos, since all the nodes have 2 links connected to the neighbor nodes. To implement the maximum degree based node selection with = 2 target nodes, two random numbers between 1 and the total number of elements in the edge vector, which is 6, are generated. Let us assume that the numbers generated are 3 and 6. Since the 3rd and 6th elements of the edge vector are node indices 2 and 3, the new node 4 will connect to the existing nodes 2 and 3 as shown in Figure 6. The updated connection matrix and edge vector due to the new node connection are shown in equations (8), (9), (10), and (11). From the updated edge vector, we can see that the number of node indices 2 and 3 has increased from 2 to 3. Thus, nodes 2 and 3 have higher probability of being selected by the new node 5 compared to nodes 1 and 4. The dynamic addition of a new node and preferential attachment processes are implemented based on three blocks: new node generation block, target node selection block, and connection matrix/edge vector update block, as shown in Figure 5. When a network construction with = + 0 nodes and edges has been completed, the final node connection matrix is used in the metric calculation blocks: average path length block and degree distribution block.

Metric Calculation Modules.
In average path length calculation block, after completion of a scale-free network construction, the node connection matrix that contains the link information between all the nodes is used for average path length calculation. The steps used in the average path length calculation block for small-world network are also used in the scale-free network system model. Furthermore, the BFS algorithm is recommended for average path length calculation of scale-free networks with a large number of nodes and edges.
Degree distribution calculation is an important factor in studying the power-law nature of the constructed scale-free network. In degree distribution calculation block, the first step is to gather degree for all the nodes in the network using node connection matrix. The range of the degree is defined based on the minimum and maximum degree values found in the first step. The next step is to bin the degree data covering all the degree range. In the last step, the power-law nature of the network is verified by plotting the binned degree data in log-log scale.

IoT Model
In contrast to the relational graphs used in general complex network models, spatial graphs are more appropriate models for wireless sensor networks (WSNs). This is because in WSN, due to practical constraints, such as energy capacity and radio transmission range, the links are restricted by the distance between nodes, rather than relational factors as in small-world networks and scale-free networks. Thus, a new incoming node will have a limited number of candidate target nodes to be connected subject to required power consumption to communicate between nodes and that can be simply defined as = min , where is the path loss exponent and min is the minimum power for acceptable reception quality. Furthermore, in heterogeneous WSN, the nodes will have different communication and energy capabilities. Possible WSN nodes are a sink node, a small number of cluster head nodes with high hardware capabilities and energy capacity, and a large number of low cost sensor nodes that are continuously added to the network over time.
Important optimization criteria of WSN are energy efficiency, average path length based on geographical distance, and network tolerance. Based on these WSN optimization criteria, the rewiring scheme in the conventional small-world network algorithm is modified to include performance metric such as energy efficiency as shown below.
Step 1. Start with a ring of nodes.
Step 3. Reconnect or add new edges based on probability function that is dependent on performance optimization parameters.
Step 4. Repeat Step 3 for all /2 edges in the ring network.
As for the WSN based on scale-free properties, the preferential attachment in the conventional scale-free network Mobile Information Systems  algorithm is modified to include performance metric such as energy efficiency as shown below.
Step 1. Start with a small number 0 nodes with degree .
Step 2. Introduce a new node into the network.
Step 3. Connect the new node to existing nodes based on probability that will maximize performance optimization criteria.
Step 4. Repeat Steps 2 and 3, until a network with = + 0 nodes and edges has been constructed.

Small-World Network.
In this section, we study the behavior of the small-world network implemented based on the system architecture with metric calculation modules described in Section 4. We initially assumed a regular ring lattice model with = 24 nodes and initial degree = 4 for all the nodes. The small-world network was created according to the system architecture described previously with various rewiring probability ranging from 0 to 1. Figure 7 shows the average path length of the implemented small-world network. One could observe that the average path length is around 4.2 for = 0 (without rewiring) and decreases to 2.6 for high rewiring probability (random network). Even with small number of random rewiring, there is a drastic decrease in average path length. Note that the theoretical average path length value for random network ( = 1) can be calculated as ∼ ln( )/ ln( ). Figure 8 shows the clustering coefficient of the implemented small-world network. One could observe that the clustering coefficient remains relatively constant with value around 0.5. However, there is rapid drop in the clustering coefficient for rewiring probability greater than 0.1. Thus, we can observe that the small-world network remains highly clustered like regular lattice for less than 0.1. From Figures 7 and 8, we can conclude that the behavior of the small-world network was fully confirmed, having highly   clustered behavior as the regular lattice and small average path length as the random graphs, based on the proposed system architecture.

Scale-Free Network.
In this section, we study the behavior of the scale-free network implemented based on the system architecture described in Section 5. Figure 9 compares the average path length for random networks, denoted as RN in the plot, and scale-free networks, denoted as SFN in the plot, for different number of nodes in the network with ⟨ ⟩ = 4. The random network was constructed based on the algorithm described in Section 3. As for the scale-free network, the initial number of nodes was set to 0 = 2 and the number of target nodes for preferential attachment by the incoming node was set to = 0 = 2. To calculate the average path length, the BFS algorithm was utilized. As shown in Figure 9, with increase in network size , substantial decrease in average path length in scale-free network is observed compared to a random network. Figure 10 shows the degree distribution of a scale-free network with total number of nodes in the network equal to = 500 for different number of initial nodes and target nodes set as 0 = = 3, 5, and 7. The degree distribution was calculated using the degree distribution calculation block described in Section 5. As seen in Figure 10, the degree distribution generated by the proposed system model follows the power-law distribution for all different values of 0 and proving that the generated network evolves into a scale-free network. Note that the noise in the tail occurs due to limited number of data to average out the noise. One of the advantages of the scale-free network is the error and attack tolerance. Figure 11 shows the error tolerance performance as a function number of nodes removed from the network. To study the error tolerance performance of the generated scale-free network, out of = 1000 nodes, randomly selected nodes were removed with removal of all the connected edges to that node. Average path length metric was used to study the disruption effect to the scale-free network and random network due to removal of nodes. We can see that when 20% of the nodes in the network are removed, the average path length of the network increases around 16% and 12% in random network and scalefree network, respectively. Note that a peak point can be observed from the figure. This point is called a critical point where the network breaks into numerous isolated clusters, resulting in rapid drop in average path length. From the small average path length, power-law degree distribution, and error tolerance performance shown in Figures 9, 10, and 11, we can conclude that the generated network fully satisfies the scalefree characteristics.

Conclusions
In this paper, we introduced basic concepts of complex networks, including small-world networks and scale-free networks. Then, separate system models for small-world networks and scale-free networks were proposed that can be easily implemented and applied to IoT network optimizations. The small-world networks model contains five major blocks: node initialization block, node connection block, rewiring block, average path length calculation block, and clustering coefficient calculation block. As for the scalefree networks model, major blocks are node initialization block, new node generation block, target node selection block, average path length calculation block, and degree distribution calculation block. The proposed system models were evaluated based on various complex network metrics.
The simulation results show that one can confirm the smallworld characteristics showing highly clustered behavior and small average path length based on the proposed system architecture. Furthermore, the degree distribution generated by the proposed system model followed the power-law distribution for all different values of 0 and proving that the generated network evolves into a scale-free network.

Competing Interests
The author declares that there is no conflict of interests regarding the publication of this paper.