Latency-Aware Power Management in Software-Defined Radios

Cloud computing provides benefits in terms of equipment consolidation and power savings from higher utilization for virtualizable software. Cellular communication software faces challenges in cloud computing platforms. BSs create a specific load profile that differs from traditional cloud service loads. Cellular communication system implementations have real-time deadlines with fixed, periodic latency requirements. In this paper, we assess the suitability of an unmodified Ubuntu Linux OS running on a commodity server to operate latency-critical software using a 4G LTE BS software-defined radio implementation. Scaling of the CPU clock frequency is shown to be feasible without excessive impact on the platform’s ability to meet the 4 ms processing delay requirement imposed by the LTE standard. Measurements show the relationship between the processor’s operating frequency and the number of missed subframe processing deadlines to be nonlinear. The results obtained also indicate that a high computational capacity does not suffice to ensure satisfactory operation since fronthaul processing overhead can limit achievable performance. Use of offload-capable network interface cards is studied as a potential remedy.


Introduction
Evolution of telecommunication systems is directed by the intersection of demand and available technical solutions. BS design evolution reflects how Moore's law has enabled increased platform flexibility. In early BSs, specialized functions were implemented by discrete elements. Over time, discrete solutions were replaced by specialized ASICs. More recently, the rigidity of ASICs was eschewed in favor of FPGA-and DSP-based platforms. Nowadays, BS processing is within the reach of general purpose processors (GPPs) [1][2][3].
Increased use of software has enabled greater consolidation of logical functions into physical devices through the use of network function virtualization (NFV), software-defined networking (SDN), and software-defined radio (SDR). Consequently, this trend has lead to investigation of the feasibility of replacing specialized hardware by commodity servers with general purpose processors (GPPs) and even PCs. In this paper, the suitability of cloud servers to provide cellular radio access processing, known as cloud radio access network (C-RAN), is assessed.
General purpose servers aim to provide reasonable throughput for a wide range of tasks and by definition are not optimized for any particular one of them. In order to serve the needs of cellular systems, general purpose computing platforms must be adapted to satisfactorily run latency-sensitive cellular communication applications while still retaining the flexibility inherent in a software-based design.
By porting radio access functionality to a cloud platform, cellular system development can be set onto the same trajectory as other cloud-based services. Virtualization enables consolidation of computing hardware, reduces system cost, provides power savings, and increases flexibility [4]. However, it comes at the expense of process isolation and reduced predictability. Very stringent processing delay requirements in cellular base stations (BSs) may make centralization impractical. In order to remedy this problem, the trend is to migrate the C-RAN computation towards the edge of the network into edge clouds [5,6].
For C-RAN to realise effective coexistence of multiple BSs, the implementation must cope with the unpredictable delays and variable processing power availability of general purpose server environments. Achieving this aim calls for the creation of SDR BS implementations and system tests to outline the limits of current server platforms in order to guide their evolution.
In addition to tight latency and timing requirements, energy efficiency has gained importance as a cellular system metric. Global energy expenditure of the ICT industry represents 2% of global consumption and is projected to reach 3% by 2020 [7]. Meeting both the ever growing and more varied needs for wireless communication, along with acceptable levels of energy consumption, requires solutions enabling power management to take into account the latency bounds inherent to real-time cellular systems. In particular, this study focuses on improving power management in a C-RAN type shared computational environment with latency constraints.
In this paper, we describe the specific computational requirements of a cellular BS and what constraints it imposes on a cloud platform. To serve this computational load takes not only software adaptation but also suitable hardware. e suitability of current commodity server platforms is investigated using a 4G cellular system BS running on an Intel server platform. Assessing the suitability of a general purpose Linux-based operating system (OS) constitutes a prerequisite to creating a suitable C-RAN to exploit the consolidation benefits of cloud technologies. Improvement opportunities are outlined based on the results obtained.
is article is organized into nine parts. e structure and functionality of a typical BS are presented in Section 2. Related work is surveyed in Section 3. Sections 4 and 5 describe the pertinent BS features and their implementation in the agile radio framework (ARF) SDR platform. In Sections 6, 7, and 8, the performance model, experimental setup, and the results obtained therewith are presented. Section 9 concludes this article.

Base Station Evolution into C-RAN
A cellular BS generates load that is characterized by high computational requirements along with stringent periodic completion deadlines. Because of these requirements, BSs are usually implemented in dedicated hardware with a realtime OS.
Each evolutionary stage provides more run-time configurability than the previous.
In a cloud environment, it is desirable to use general purpose hardware and virtual machines executing non-realtime OSs. Cloud-based SDRs would be implemented as an application process that can be created with libraries, toolsets, and frameworks employed in general software development. Implementation of BSs in centralized servers is called cloud radio access network (C-RAN). C-RAN systems are usually split up into remote radio head (RRH) and baseband unit (BBU). ese units are connected over a fronthaul link. e most efficient way to split functionality between BBU and RRH remains an open question. One of the issues is in which system component to place physical and MAC layer processing. is paper describes a centralized approach to C-RAN. RRHs receive modulated samples and perform baseband signal up and down conversion. e BBU then handles the computationally heavy baseband processing.
Softwarization of implementation combined with the physical decoupling of the BBU from the RRH introduces the possibility of virtualization and centralization. A single server may operate multiple BSs serving multiple cell sites. ese sites only require RRHs and a communication link to the centralized server. By eliminating computational resources from access points, a more efficient architecture can be realised, in which processing resources are allocated from the pool based on demand. Enabling demand-based resource provisioning, in turn, allows for many low-cost, lowenergy RRHs to be built to enhance coverage uniformity.
In the coverage area of a small cell, the number of active users can vary significantly. is, in turn, wastes energy with current peak-dimensioned designs due to a high probability of transmitters being idle [11]. Turning off idling base stations is one potential solution [12][13][14] but becomes problematic if one aims towards low-latency and high reliability communication. In such cases, when having a wake-up delay is unacceptable, energy reduction techniques constitute a preferable alternative.
Cellular networks exhibit spatial and temporal patterns in the load they experience [15,16]. Daytime load is higher than during the night, when many users are asleep. Furthermore, due to diurnal variations between areas, load varies spatially by C-RAN instance as well as users move from residential areas to workplaces and back.
While RTOSs exist to solve latency and jitter challenges, they cannot be used in a virtualized environment since they cannot offer any guarantees when running on top of a nonreal-time hypervisor controlling access to the actual hardware. Indirect access to hardware through the hypervisor further complicates management of timers and interrupts. Additionally, a real-time hypervisor's strict division of time between guests restricts opportunities for exploiting variable loads in guests.
Alternatively to RTOS, overprovisioning can be used to meet requirements.
is method wastes both capital and energy. e predicted increase in the number of cell sites in 5G only compounds this problem.
In light of the above, it becomes necessary to investigate the ability of virtualizable, general purpose OSs and hardware to meet the timing requirements of BSs. In a non-realtime OS, packets can be lost due to missing the packet preparation deadline, denoted as late packets. In this work, soft-real-time is defined as a task that must meet its deadline on average but can tolerate occasional overruns [17,18]. Conjointly, the relationship between energy efficiency and performance must also be investigated to determine design parameter trade-offs. Slowing down computing helps to run processors at more power efficient clock speeds.
By exploiting the diurnal variation, operators can save energy by reducing the processing power and thus energy consumption during off-peak hours and in lightly loaded cells. is could be realised, for example, by grouping the processing of geographically close cells in the same C-RAN server. e power-performance profile of each server (and therefore cell group) can then be adjusted according its specific needs. Further refinement of the configuration granularity becomes possible on hardware support per-core adjustment of frequency and voltage.
Determining parameters with highest impact allows for the design of cloud systems better suited to the needs of virtualized BSs. e BS can reduce the amount of late packets by adjusting the radio link throughput and thereby the computational load.
is opens up the possibility of leveraging a C-RAN platform to allocate computing resources intelligently. e following sections review research that has been conducted into latency-sensitive and energyefficiency conscious software.

Server Platforms for Time-Critical
Systems. Management of system power consumption through online variation of processor voltage and frequency is known as dynamic voltage and frequency scaling (DVFS). Work presented in [19] proposes a statistical scheme to estimate and adapt to variation in load for latency-critical tasks. e authors consider variable arrival and processing times. ey determine that queuing delays often dominate total latency. While similar in aim, the characteristics of the offered load in the present article are different. Due to the periodic nature of the LTE frame structure, arrival times are well determined and required completion times are known. Consequently, stale work items in queues can be discarded based on this known deadline. e BS also partially knows future load due to its control over resource allocations for connected devices it serves (see Section 2). e authors in [20] similarly focus on queuing. ey however propose extending the time a task spends in the queue in order to obtain longer periods during which the processor resides in a low power state. Such a solution can be problematic in cellular systems, where communicating nodes must uphold strict timing in order to preserve synchronization. DVFS use in an embedded computer has been investigated in [21]. e authors studied the impact of voltage scaling on the worst-case execution cycle counts of their test programs. Unlike the present article, none of their test cases involved external interrupts caused by fronthaul data arriving at the network interface card (NIC) or other peripherals. Problems involved in operating low-latency systems involving substantial network communication are discussed in [22]. e authors of [22] discuss the general characteristics and broad outlines of potential solutions but no concrete solutions are offered. e aforementioned work demonstrates that power savings are possible but the case of interrupt-heavy periodic loads has received little attention.

3.2.
Software-Based BS Implementations. Implementation of tightly latency constrained wireless communication systems on server hardware has been investigated before. In [1], the authors present their approach to running the LTE physical layer on Linux. In contrast to the approach used in this paper, they do not implement protocol layers above the physical. It is also unclear whether time-division duplexing (TDD) or frequency-division duplexing (FDD) is employed. In order to obtain satisfactory performance, the authors pin execution threads to specific CPU cores according to a handtuned pattern. In [23,24], communication with the radio frontend occurred over a Peripheral Component Interconnect Express (PCIe) bus instead of an Ethernet connection as in the present work. Furthermore, in the latter case, the authors used the real-time framework Xenomai to introduce timing guarantees to the OS. Work presented in [25] reports on the methodology used to implement TD-SCDMA. e authors modified the Linux kernel's scheduling interrupt timer's frequency.
is technique is no longer applicable to modern tickless versions of the kernel. Techniques applicable to real-time approaches, such as Xenomai, are more cumbersome since they attempt to enforce timing constraints on the entire BS software stack instead of only network packet processing. Interrupt patterns of IP-based traffic differ from those of PCIe. Studies into the performance of IP-based soft-real-time SDR are therefore needed.

Contribution.
e above-presented related works considered either the problem of implementing latency-constrained communication systems or the reduction of energy consumption in processing local tasks. is article aims to combine both aspects and provide a scheme for reducing power usage while maintaining processing delays acceptably low. Energy savings are accomplished through CPU clock frequency scaling. Measurements quantify the dependence between the operating frequency and the BS's ability to complete tasks in a timely fashion. Factors taken into account include processing in the BBU, the radio frontend, and the network communication between them. e present work explores the feasibility of implementing BSs on GPPs with a standard Linux OS kernel through tuning of hardware and OS parameters. e aim is to assess the suitability of a soft-real-time latency-critical cellular software to operate on cloud infrastructure. e techniques used do not require any modifications to the standard OS kernel or the hardware. is helps to ensure applicability in C-RAN execution environments.

Logical Functions in a Base Station
All BSs present the same functional interface to user equipment (UE) to allow interoperability. From the network's point-of-view however, they are not identical nor do they serve the same purpose. Some BSs, known as macro BSs, target a large coverage area called a macrocell. Complementing these are micro-and picocells serving demand in small hotspots. Large macrocells are suitable for serving fast moving users, since their large coverage area will result in less frequent handovers than if the user was served by more spatially limited access points.
While all the BSs are equipped with the same specification-mandated functions, they might be called at different rates and with different target parameters. As such, BSs can generate different computational loads. ese differences provide opportunities for intelligent resource adaptation.
BS radio interface supports two-way communication. e BS receives data from the core network and prepares it for transmission over the air interface and conversely it takes the signal from the radio interface and converts it to bits to be transmitted to the core network. In order to compensate for channel errors, signals have to be acknowledged positively or negatively. e specification-mandated period [26] within which this must be accomplished sets a constraint on computing speed for decoding received data and sending out the acknowledgments. e processing in BS can be split into physical layer processing and higher layer processing. e physical layer related processing is mainly execution of data flow type computationally heavy algorithms. Higher layers deal dominantly with users and protocol states. Figure 1 depicts the processing flow for data to and from the air interface. e BS's external connection to the mobile core network is beyond the scope of this work.
Functions dealing with digitized samples of the signal received at the antennas are collectively known as Layer 1 (L1). In 3GPP-specified cellular systems, L1 operates on groups of samples contiguous in time known as subframes or transmission time intervals (TTIs). In modern cellular systems, the TTI is the basic time unit for processing. e samples to be transmitted have to be prepared for each TTI and received samples have to be analyzed during certain TTIs.
e main functions in L1 are modulation, equalization, channel coding, and waveform processing. Processing load depends on the modulation and coding scheme (MCS) which the BS selects based on a mapping from the observed radio link quality. e MCS value selects operating parameters, such as constellation size and coding rate, for the modulator, demodulator, encoder, and decoder. In traditional systems, the choice of modulation accounts only for the quality of the channel between communicating nodes. In cloud-based systems, there exists an additional trade-off based on processing requirements. A higher MCS value allows for more bits to be transmitted but requires more processing for receiving those bits. e server can trade off lower data rates for reduced processing power requirements.
Above the physical layer reside the medium access control (MAC) functions. ese primarily handle resource allocation, scheduling, and channel multiplexing. e MAC also carries the responsibility of generating acknowledgments and negative acknowledgments as well as responding to those received from the UE. In the LTE standard, the available response generation time is never shorter than four milliseconds. Resource allocation relies on these acknowledgments and on channel quality reports to inform the BS as to which transmission parameters are appropriate for the conditions experienced by each UE. Since the computational load is a function of user resource allocation, the scheduler can directly impact the BS's processing burden.
Connected to the core network are the BS functions dealing with IP packets. ese provide header compression, encryption, integrity protection, and segmentation. Encryption and integrity protection provide security. Segmentation and concatenation functions are necessary to enable IP packets to fit the available radio resources. Header compression helps in this regard by eliminating needlessly redundant information from transmitted packets.

BS Implementation Using the ARF Platform
e agile radio framework (ARF) is a soft-real-time SDR platform designed to run on commodity computer hardware using a GNU/Linux OS as shown in Figure 2. In this work, it is used to implement an LTE BS in order to evaluate C-RAN latency and computation load issues. e ARF is designed to be compatible with a wide variety of different server platforms. As such, the ARF does not require special drivers, beyond those for the radio frontend in use, or modifications to the kernel of the OS. is entails that no hard-real-time extensions need be applied. While RTOSs provide tools to solve latency and jitter-related issues, they cannot do so when operating virtualized on top of a hypervisor. Additionally, compatibility of the ARF with standard software tools and libraries leads to wide portability. e same applies to online migration of virtual machines. Apart from the radio frontend, the ARF does not need any special hardware. However, use of such is not precluded by the design. For instance, one could use a graphics processing unit (GPU) to accelerate computations [27].
Architecturally, the ARF platform is divided into three main components as depicted by Figure 3. e lowest layer is the Virtual Hardware Enhancement Layer (VHEL) [28]. Its purpose is to serve as an abstraction layer for hardware nonidealities. Radio frontend hardware interfacing is done through the Universal Software Radio Peripheral Hardware Driver (UHD) [29]. Since the ARF does not assume hardreal-time guarantees from the OS, protocol processing may be late in the upper or lower layers. As explained in Section 2, cellular systems require transmitted and received samples to be ready for a specific TTI. In a non-real-time system, the processing could exceed its allotted time or be delayed past its deadline. In this paper, such events are termed lates. Upon occurrence of a late, the VHEL sends a preconfigured contingency subframe to the radio frontend to help maintain timing alignment. is could, for example, be a zero-filled subframe or a pregenerated subframe containing only pilots and synchronization signals if appropriate. In the receive direction, the VHEL will similarly mask lates and overflows by handing the protocol's physical layer a subframe containing the correct number of samples even if some were lost. e VHEL hides lates and therefore the soft-real-time nature of the platform from upper layers. ese are written as if the samples would always arrive on time. Upper layers can therefore apply block-based processing while ignoring timing related issues. e communication protocol's retransmission mechanisms can take care of performing a new transmission of data lost due to lateness of processing in the transmitter part 4 Journal of Electrical and Computer Engineering of the ARF platform. From the communication protocol's point-of-view, this is no different than experiencing poor channel conditions, provided late processing occurs seldom enough. While this approach does not provide the same level of timing determinism as RTOS-based designs, it simplifies development. Code related to handling TTI timing and meeting deadlines resides only in the VHEL. Other code may be written as non-latency-critical software. In case performance proves insufficient, traditional software optimization techniques can be applied until deadlines can be met with some probability. Furthermore, an intelligent cloud platform could take advantage of the difference in processing load between different BSs as described in Section 2 to adapt its resources optimally.

C-RAN Processing Performance Model
e main challenge in C-RAN implementation is reduction of late packets. Whether a packet is late depends on the total packet processing time. is time is a function of the packet processing time within the BS stack T BS , delays related to operating system functionality T OS , and delays related to data transmission over the fronthaul link T f . Knowledge of the latter is required in order to estimate how far ahead of the deadline a packet must be sent to arrive in time at the frontend.
Accurate total processing time information might not be available to the C-RAN. In practical systems, it is easier to measure the number of late packets as a function of these various delays F P (T BS , T OS , T f ). e function F P provides a tool for adapting BS processing. To achieve a given packet outage target P out , the C-RAN should manage delays such that While allocating its resources, the C-RAN can track F P in real time. Optimal allocation raises the question of whether F P captures the system's characteristics sufficiently well and which parameters are the most significant for system performance. Function F P is implementation dependent. However, system-specific dependencies can be measured and learned for each particular platform at run time. Journal of Electrical and Computer Engineering e ARF platform is used as a measurement tool for identifying cloud processing bottlenecks while executing an SDR BS load. Execution time is subdivided into two parts: cellular BS-related functionality and operating system housekeeping activities. e former constitutes the lion's share of the load. In turn, it is composed to two components. e first is air interface processing, comprised of relatively computationally heavy physical layer algorithms executed each in TTI (1 ms interval in LTE). Higher layer communication protocol processing executes separately from the physical layer and presents a comparatively much lighter processing burden. e second component includes all noncore-functionality processing, termed management processing, such as data copying, OS function calls, and fronthaul data transmission. Contention for these resource as well as OS kernel scheduling decision can influence whether processing completes on time. Overall C-RAN behavior is characterized by how the system management and BS processing work together.
is was investigated through the insertion of measurement points into the BS and management parts of the ARF code.
e results presented in this article were obtained by running a partial LTE Release 8 payload in TDD mode on the ARF. Consequently, the TTI duration is 1 ms. e measurements were done while sending data over the air. Performance of the TX processing impacts the RX processing and vice versa. Processing-time overruns in one will reduce the CPU time available for the other. is stems from the need to know whether to send a positive or negative acknowledgment and, on the transmit side, whether the retransmission buffer can be cleared after a successful reception by the remote node. UEs must first receive grants from the BS before they know when and with what parameters they may transmit. is limits the available time to prepare data to the time between receipt of the grant and the transmission time.
Packet processing time (including transmission to RRH) should be less than the TTI duration. Each component of the packet late function F P contributes to overall processing time differently. e delay T BS , related to user load, is a linear function of the computational demands on the CPU. e more data the UEs want to receive, the more work the BS must perform to transmit it to them. In order to fit a greater quantity of data into the same, fixed, amount of spectrum, more complex modulation, equalization, and channel coding schemes must be employed. Doing so increases computational complexity. Increased load can be managed by adding more processing capacity. is could, for example, be more or faster CPUs as well as accelerators.
Fronthaul transmission-related delays are composed of data transmission and endpoint processing. For dedicated fronthaul links, the data transmission delay T f is largely fixed and adds latency without any means for the BBU to compensate. Endpoint processing, on the other hand, can be sped up in the BBU. Typically, it is the OS that handles network-related tasks. It is therefore possible to obtain improved performance by switching to more capable network hardware, a newer kernel, or different OS. e main contributor to OS-related delays T OS is interrupt processing. It causes delay and jitter due to context switches to interrupt service routines and handlers [35]. Context switches degrade performance even if they are not computationally expensive as they might cause data to be evicted from cache, new processor state information to be loaded, and virtual memory translation lookaside buffers to be flushed [36]. Since context switches can be caused by aperiodic background processes and hardware interrupts, their effect on an SDR platform-or any other software-is not easily predicted or quantified by said software itself. Mitigation strategies must therefore extend beyond the SDR code base into the operating system, hardware, and load management.

Experimental Setup
Measurements to quantify the contribution of each factor of F P (equation (1)) were carried out using two different systems. ese measurements serve to quantify the impact of general purpose hardware and operating systems on cellular BS implementation performance. Table 1 presents the test systems' main components while Table 2 lists the Linux kernel boot arguments. Both systems were directly connected to their respective frontends using Gigabit Ethernet. e ARF SDR platform was configured as an LTE Release 8 BS with one transmitter and one receiver antenna. Assessing the impact of protocol processing was done through the use of the data plane. Turbo Code decoding was used to create a uniform and constant load. Five iterations of the decoder were applied to 25 resource blocks (RB) in the LTE subframe structure. Modulation and coding scheme (MCS) 9 was used as defined in LTE Release 8 for uplink data transmissions.
e operating system used was Ubuntu 16.04 LTS. In an effort to reduce interrupt and CPU contention for the BS code, OS boot-time parameters were modified from their defaults. e aim of the changes is to improve the performance of the ARF platform by dedicating resources. Listing 1 gives the parameters applied to the first system's Linux kernel on boot and Listing 2 presents the same for the second system. e "intel_pstate � disable" parameter was used to prevent dynamic control of the operating frequency in order to enable consistent measurement runs. e OS's scheduler was instructed to avoid CPU cores 2-7 and 10-15. Doing so reserves them for use by the BS code. e parameters in Listing 1 and Listing 2 were in addition to the "root�" parameter as well as the default values of Ubuntu 16.04 for each system, given in Listings 3 and 4, respectively. No additional kernel boot settings were applied to System 3.
In addition to boot-time parameters, measures were taken after system start-up to further improve performance on System 1 and System 2. e governor for all processor cores was set to the "performance" setting. e energyperformance bias parameter was also changed. It was set to indicate to the CPU to prefer higher performance over lower energy consumption. is was done in an effort to minimize latency and increase the repeatability of the data since the decisions of the CPU are opaque to the user. e SDR platform process was also executed with a high priority. is 6 Journal of Electrical and Computer Engineering helps to ensure that the OS's scheduler tries to minimize interruptions caused by other tasks requesting processor time. Ensuring constant operating parameters reduces jitter from transitions between them and helps avoid unnecessary delays. Such transitions were observed to occur frequently. One possible reason for this is that relatively low CPU utilization rate led the OS to reduce clock frequency in order to save power as it believed the system to be relatively unloaded. In fact, the system was likely busy moving data or waiting for it. Waking up from a low power state when the data does arrive takes more time than resuming processing from a fully on state. e cset application [37] was used to set the ARF code to run on the isolated CPUs. On System 1, two cores and their adjoining Hyperthread cores were kept for other processes while on System 2, the corresponding number was one. On System 3, the CPU governor was set to "performance" and the interrupt coalescing value to 10 μs.
Although the measurements presented in this work were obtained on a physical host without virtualization, we argue that the results provide meaningful insights into C-RAN implementation for three reasons. Firstly, a C-RAN differs from a public cloud service in that the platform is purposebuilt to host cellular BBUs under the control of the same administrative entity, i.e., the network operator. As such, fewer abstraction layers are required than in a public cloud. For the same reasons, direct access to hardware resources can be granted to guests. Furthermore, hardware-assisted virtualization technology, such as single root input/output virtualization (SR-IOV), enables a single device to be shared amongst the guests at a hardware level with low overhead [38]. Secondly, being a purpose-build platform also makes it feasible to provide an interface (i.e., paravirtualization) for the guests to communicate their load status to the hypervisor. e latter can then utilize the similarity of the diurnal cycle induced load (see Section 2) across neighboring BBUs to scale down the CPU frequency of the whole system.
irdly, an operator's business model differs from a public cloud provider. e latter might oversubscribe resources to increase revenue [39]. Another difference is that public clouds do not know a priori what load type their customers will run. Infrastructure must therefore be designed to be generic. A C-RAN platform, on the other hand, knows the load type to be executed and places meeting latency requirements first. Consequently, resource allocations can be tailored to the task at hand. Furthermore, since payloads in a C-RAN platform originate from trusted users, less need for hardware abstractions and security isolation overhead exists. Performance, therefore, behaves more akin to a physical host, yet still provides the benefits of consolidation by enabling multiple BSs to be hosted on a single server.
Measurements were performed to determine the impact of two different system parameters on the performance of the ARF. e first one was the clock frequency of the CPU. It was varied from highest to lowest supported by the CPU (see Table 3). CPU clock frequency measurements were done to study the feasibility of reducing power consumption without compromising performance. A lower clock frequency enables operating voltage to be dropped, which in turn leads to power savings. e CPU's operating speed impacts the T BS parameter of F P . e second type of measurement involved varying the interrupt coalescing behavior of the NIC (see Section 3.3). Table 3 shows the parameters used for interrupt coalescing. e value labeled "rx-usecs" indicates the amount of time in microseconds that the NIC was configured to wait for further network packets to arrive before notifying the CPU in an effort to improve the interrupt-to-payload ration. Interrupt coalescing was selected as test parameter to assess the impact of OS processing (T OS in F P ) on BS performance. A greater number of NIC interrupts result in more processing time spent in kernel space. is does not significantly change the computational load as each network packet still needs to be processed but does reduce the number of context switches required.

Results and Analysis
Results from the measurements are reported as the probability of the subframe processing being late. It is computed by normalizing the number of late subframes by the total number of subframes. is metric was chosen instead of the more conventional CPU usage as it better reflects the objective: to process received subframes and prepare new ones to transmit within the alloted time. It would indeed be possible for the processor to be relatively unoccupied but to have data be delayed by other factors (for example, network stack processing or interrupt moderation by NICs). CPU clock frequency and interrupt coalescing measurements  were carried out using System 1 and System 2 while the NIC offload measurements were performed using System 3. Figures 4 and 5, performance can be seen improving from the lighter processing load to the more demanding one. Since the ARF platform can handle higher data rate (load) better, improved performance at a higher load suggests computational capacity is not the sole factor influencing performance. To confirm that delays outside of the SDR platform impact performance, measurements were carried out to determine the effect of additional system delay in processing incoming network traffic. e NIC's interrupt coalescing timeout was varied to introduce extra delay. Despite load, bandwidth, and CPU clock frequency remaining constant, Figure 8 shows an increase in the ratio of dropped TTIs.     Journal of Electrical and Computer Engineering

Performance per MHz. Comparing
Even though a subframe may be ready in time in the BBU, it still needs to be transported to the RRH. When setting the frequency of the CPU, frontend transmission delays must also be taken into account. For receiver-side data processing, the same principle applies in reverse; not all of the TTI duration is available for baseband processing. is reduction in available processing time must be taken into account in the design of the SDR platform and the configuration of the OS it runs on. Unlike RTOS running on ASIC-based designs, a margin should also be included to protect against jitter in network processing caused by competing processes running on the same machine. e measurement with heavy protocol processing load leads to the situation depicted in Figure 6. e shape of the curve remains the same but the ratio of dropped TTIs below the threshold frequency rises. Performance above this threshold point remains essentially unchanged. Operating the CPU at full frequency therefore needlessly increases energy expenditure without providing a corresponding improvement in the timeliness of computations or the performance of the platform in general. It is already operating at effectively the same performance as a dedicated design with preset timing and scheduling.
A potential alternative would be to execute the job as quickly as possible using the highest performance setting and then enter a low-power state. Such an approach may not be suitable in all cases. Resuming normal execution from a low-power state incurs a delay and consumes full power [40]. erefore, in order to be beneficial, the length of the idle period must be long enough that the energy and performance penalty [41] of transition into a power saving state is recouped. e latency is of particular concern in cellular systems. Processing a subframe requires obtaining all its constituent samples from the radio frontend. ese do not arrive at the same time in one transaction but are instead sent in chunks, the size of which is determined by the Ethernet maximum transmission unit (MTU) size. Frames are sent as soon as they are filled with samples. is means the C-RAN platform must continuously process incoming data from the network and thus requiring CPU involvement. Delaying processing of arriving data means accumulating work to be done once the last sample is received. is backlog of work then creates extra delay in the handling of the subframe as pending data frames must be handled first. Similarly, if TTIs get shorter or subframes are processed in smaller parts to reduce latency, the intervals available for power states decrease even further. However, the frequency-scaling method presented in this work is not mutually exclusive with other power management techniques but rather a complementary approach. For example, it may be possible to hand over users from several lightly loaded BBUs to one. en, the all but one could be put into power saving mode while the active one executes with the lowest power state providing the required performance. How to combine these techniques as part of an overall solution is left up to future work. Figure 7 indicates that the existence of the energy-saving frequency threshold is not specific to System 1. System 2 exhibits the same type of behavior but with a greater potential energy saving due to a higher maximum CPU operating frequency.    Journal of Electrical and Computer Engineering

Performance
Modeling. C-RAN server run-time adaptation requires learning the BS's performance vis-à-vis its load and to be able to predict the impact of changes in operating parameters. One way to accomplish this is to extract a dependency function, the parameters of which can be learned in real-time. Such a function was sought using collected performance data. Analysis of the data (see Figures 4-7) suggested that the impact of CPU frequency can be divided into two distinct regimes: one in which it majorly affects the late rate and one in which it does not. Due to this shape, two function families were selected: (2) e criteria for the selection of the equations' form were being a function of CPU frequency and sampling rate, possessing two regions (steep and shallow), and having a suitably rapid transition from one region to the other. Parameterwise, CPU frequency was selected since it is user tunable and significantly affects both performance and energy consumption. Sampling rate similarly heavily impacts the amount of data needed to be processed. rough preliminary analysis of early measurements, these two factors were picked as being the most significant.
Studying the effect of sampling rate provides one of the defining characteristics of C-RAN, namely, a steady and constant inflow of samples to process. e analysis could, however, be made more general by considering a generalized load factor. Such a load factor would be constituted of the traffic-independent part (i.e., sampling rate) and the trafficdependent part (i.e., per-resource block processing). e latter would be subject to the diurnal patterns discussed in Section 2. e analysis of such a generalized model is left up to future work.
While the general form of the function could be determined by inspection of the collected data plots, parameter combinations are too numerous for manual determination to be practicable. Instead, an exhaustive search through all the parameter combinations determines the best selection, as well as their coefficients as presented in [42]. Functions g 1 , g 2 , and g 3 (equations (4)-(6)) contain these coefficients: Fitting of the model was done using the least squares method in MATLAB [43]. Mean-squared error (MSE) was used as the goodness-of-fit metric. e least squares method exhibits sensitivity to its starting point. Multiple initial values were therefore employed. Initial values were obtained as two-step process. First, random values were obtained by multiplying a random number uniformly distributed in [0, 1] with a multiplier. Table 4 lists the multipliers used. Each parameter of the model received an independently selected value. An initial fitting was then conducted. A subset of the lowest MSE coefficients was retained to serve as initial values for subsequent fittings.
Iterations of the search potentially yield unsuitable results. is may occur due to ill-conditioned Jacobians or rank deficiency. Any set of coefficients producing issues of the aforementioned type, as reported by MATLAB, was rejected. Additionally, significance of each parameter to the overall model formed an additional criterion. Parameters more likely than α � 0.05 to result from the null hypothesis resulted in rejection of the generated model.
5526e + 04 -1.0976e + 13 9.3501e + 07 9.9289e + 05 -    e corresponding values for f 1 are given in Tables 7 and 8. Similar results for System 2 are omitted for brevity. ey followed a similar pattern to the results for System 1.
After an initial sharp drop in MSE between one and six parameters, performance remains mostly stagnant from seven parameters upwards. For f 0 , the lowest MSE occurs for 14 parameters and for f 1 , at 13. Lower parameter count models are preferable as they are less likely to overfit, which degrades predictive accuracy, and since they are less costly to fit parameters for processingwise. erefore, we select f 0 at seven parameters as the most suitable model (its MSE being lower than f 1 at the same number of parameters). Equation  Figure 9 illustrates the selected model's fit for System 1 data. To assess the impact of specific system characteristics on performance, Figure 10 presents the fit f m with the data recorded for System 2. It can be observed that while the general shape of the curve remains similar, there is more divergence between modelled and recorded than in Figure 9. Residual plots (Figures 12 and 13) for the selected model indicate a different pattern on both systems. e quality of the fit improves when the model is retrained based on System 2 data as shown in Figure 11. is indicates that the general form of the function applies to multiple platforms. Machine-specific variations are expected since hardware, OS, driver, and load differences impact behavior. Completely accounting for all such variations in practice presents an almost infeasible task.
Determining when diminishing returns set in as a function of f m accomplished by analysing the relationship between the dropped TTI ratio and the CPU clock rate. Beyond the threshold frequency, performance stops increasing. A practical SDR platform implementation may track this information and use it to build time averaged statistics. ese in turn can be utilized in online computations of the predictive equation's parameter values reflecting the current state of the execution environment. Shorter averaging times lead to a more responsive estimation of the required CPU frequency while longer ones reject transients better.
In addition to the statistics collection above, SDR systems may also utilize communication protocol scheduler data for performance-energy optimizations. Since in an LTE system UEs may only transmit when given a grant by the BS, the latter knows to a large degree its receiver-side processing load some period of time into the future. It is not possible to account for all incoming traffic (e.g., random access bursts), so a margin should be left to avoid underestimating the required computational resources. No fluctuation margin was included in the values computed in this section as that is a system design issue and therefore dependent on the use-case.

Network Processing Impact.
Cloud infrastructure exhibits not only a non-real-time computational  -       environment but also non-real-time data transfers. Samples must be moved to and from the radio frontends to the virtualized BBU. Network traffic interrupts were observed to have a significant impact on platform performance. Network traffic originated interrupts can disrupt the dataflow processing of the physical layer. is will not be visible in the CPU usage of the baseband task but can still cause processing to be late. CPU clock frequency      adjustment algorithms may therefore make incorrect predictions. A smaller standard deviation provides a more predictable environment for the SDR platform to adapt to, thus making it possible to scale CPU frequency down more aggressively.
Measurements were carried out to quantify the proportion of time spent handling network related functions. In an effort to isolate variance to that caused by network processing itself, no LTE processing was executed concurrently. Tables 9 and 10 show round-trip time (RTT) obtained for 50 megasamples per second with 368 samples per TTI at two different MTU sizes.
One method to relieve the CPU is the use of offloadcapable NICs. ese take on the task of processing incoming network packets instead of the OS. As such, they only deliver ready-to-use payload data straight to the application as there is no need to invoke any OS kernel functionality. Offloading network processing presents gains orthogonal and independent of baseband processing capacity.
In both the small and large MTU cases, offloaded network processing provided the best performance. e much lower standard deviation improves reliability of the BS software when the CPU clock frequency is adjusted close to the minimum required as described in Section 8.1. ese latencies measured represent a lower bound on TTI preparation time that is independent of computing capacity. It should also be noted that the offloading results in less CPU usage and context switches.
While LTE requirements can be met, the impact of network processing on SDR performance becomes more severe as TTI durations decrease. In 5G, TTI duration is expected to shrink to provide lower latency communication.
e ratio of TTI duration to interrupt handling latency will worsen the risk becoming the limiting factor for 5G SDR implementations. Additionally, larger bandwidth will require more samples to be transferred between RRHs and BBU. e resultant increase in network traffic increases the number of packets to process and thus network processing load on the OS.

Conclusion
Cellular BS execution on general purpose hardware and OSs is feasible. e SDR aim of increasing the flexibility of communication protocol implementations can be met if system parameters are appropriately taken into account to achieve optimal performance. It was demonstrated that acceptable performance can be obtained without resorting to a hard-real-time OS or dedicated ASIC-based designs. is opens up the possibility of consolidating BS processing into C-RAN data centers. is could further increase the flexibility of SDR system deployment by enabling on-demand allocation of resources to those cells experiencing load while activating power saving measures on unloaded ones. Concentration of computations produces savings in the hardware investments required to build the network. In addition, energy savings can be realised through CPU frequency scaling. Measurements showed that the relationship between operating frequency and the probability of late subframe processing is nonlinear. By operating at, or slight above, the threshold point, performance can be kept at virtually the same level as fully on but with a significantly lower energy consumption.   It was shown that optimising for computations-persecond is not sufficient-latencies and jitter caused by competing processes must be taken into account. On the other hand, energy-efficient use of computing resources demands that CPU clock frequencies be kept low and idle periods extended as much possible. In order to combine these two objectives-few missed TTI deadlines and high power savings-parameter tuning should take into account not only the processing needs of the application itself but also delays external to the platform. In particular, latency from passing data from the NIC through the network stack to the user process was studied. It was found that network traffic generated interrupts impact processing time and must therefore also be taken into consideration. Especially for shorter TTI durations, network processing delay can constitute a substantial portion of the total computation time budget. Accounting for this is of particular relevance in C-RAN environments, where RRHs are physically separate from the servers executing the baseband processing. Mitigating the impact of moving data to-and-from BBU constitutes an important avenue of research to improve the performance of virtualized cellular software platforms. NICs offering full network stack offload constitute one possible solution.
Several metrics along with a performance model were also proposed to help in estimating suitable operating points for the CPU clock frequency. Such a model may be used to determine appropriate resource allocation levels for software-based BSs according to load. Design of a self-tuning platform is left up to future work.

Data Availability
e latency measurement data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.