Study of Energy-Efficient Optimization Techniques for High-Level Homogeneous Resource Management

Resource management e ﬃ ciency can be a bene ﬁ cial step toward optimizing power consumption in software-hardware integrated systems. Languages such as C, C++, and Fortran have been extremely popular for dealing with optimization, memory management


Introduction
For many software applications, dynamic memory management (DMM) has become prohibitively expensive. According to studies, C-programs can consume up to 30% of the system's operating time in memory release and allocation. Object-oriented programming (OOP) frequently results in additional work and removal. According to the data, C++ programs share memory more than similar C programs. The causes, however, are unknown. At this time, no data sharing patterns for C++ applications have been reported. This emphasizes the importance of quantitative analysis in allocation patterns in order to achieve the best possible system structure. This paper introduces a novel approach to investigating memory allocation at the source code level. We begin by classifying all of the conditions that may necessitate the use of dynamic memory management (DMM) in C ++. These memory allocation patterns, according to our theories, are linked to the C++ system or language. DMM requests from builders, copywriters, or overloaded operator overload, for example, are not the same as OOPs in C++. Members of a program requesting a new or deleting operator have functions that are directly related to the program. A novice C++ programmer, on the other hand, can easily write a C++ program without using the object-oriented paradigm.
New allocation strategies focusing on parallel allocation [1], the spread of multilayered architecture, and the use of multithreaded applications have been proposed in recent decades. Following the introduction of 64-bit programs and the widespread acceptance of large-memory applications, the fragmentation problem, which had previously received little attention in allocator design, has emerged as a major issue that will degrade both space efficiency and performance.
Current memory allocators, in particular, are focused on using fast memory allocation and deallocation, and they all use the same process to organize virtual memory in multiple bins. The hoard memory allocator [2] portion, for example, has 32, 64, 128, 256, and 512 byte bins. We will give it 64 bytes from a 64-bit bin if we want to allocate 47 bytes. This method of memory allocation is clearly faster, but there are 64-47 = 17 bytes wasted.
With apps that allocate less memory, this design has worked well in the past. However, if we have a resourceintensive application that allocates in the same way, the waste is enormous. This massive waste not only improves space efficiency, but it also causes virtual memory explosion, resulting in more TLB misses [3], which will severely impede performance. In response to this new problem, this paper proposes a new heap or memory allocator design that focuses on fragmentation reduction. We concentrate on large memory allocation and provide them with the exact size they want to split in order to reduce TLB loss [3,4] and improve performance. Experiments show that, when compared to Hoard, our new memory allocation design can earn up to 1.3x performance (average over 28.8%) with less memory usage (18%) on large memory-footprint benchmarks that share multiple items, indicating great potential for widespread acceptance. Our memory allocation is a common memory allocation that can be used in a variety of applications, including logic control programs and scientific computing.

Literature Survey
DMM has proven to be a cost-effective component in the majority of programming languages. Memory allocation and deallocation define the overall efficiency of many software systems, as described by Michael Neely [5]. He and Zorn [6] demonstrated that memory-intensive programs consume up to 40 percent of the runtime to allocate and free memory. Memory allocator has a significant impact on program efficiency in terms of performance and memory space [7]. According to related research, managing dynamic objects is just as simple as allocating and dealing with them.
Maas demonstrated a novel approach to memory fragmentation and object lifetime management during program execution. On several production servers, he reduced memory fragmentation by up to 78 percent by only using huge pages. Allamanis [8] demonstrated that as the number of objects grows, so does execution time and memory fragmentation. He graphed the results and found a logarithmic curve that showed a direct relationship between objects and execution time. Existing C/C++ memory allocators use a number of strategies to reduce average fragmentation in C/C++ programs [9]. Several methods for solving this problem are evaluated, and only some of them are found to be useful [10]. These methods turned out to be inherently limited and inapplicable in all situations. Robson demonstrated that allocators can suffer from large memory fragmentation during various experiments, which can have negative consequences for the program's overall efficiency and even result in program crash or failure [11]. Cohen and Petrank use partial compaction to prove upper and lower bounds on defragmentation [12,13].
To reduce fragmentation, TCMalloc, a well-tuned allocator [14], was used to organize and compare execution time versus object size.
Its current heap profiling mechanism does a good job of identifying long-lived objects by generating a list of sampled objects at the end of the application's execution, the majority of which are long-lived, including their allocation sites. Installing an HTTP handler accessible by paper of [15], an open-source profiling and analysis tool was used to compare the results. This made it possible to compare how many of these allocations were allocated and deallocated on the same CPU or thread. The result is saved into a protocol buffer at the end of a sampling period. Bayesian [16] simulated various scenarios and came up with a result that predicted object lifetime during program execution using various strategies and optimization techniques. Languages like LISP and Java have had garbage collection for a long time [17][18][19]. Compaction is implemented as part of the trash collection algorithms in modern runtimes such as the Hotspot JVM, the.NET VM, and the SpiderMonkey JavaScript VM [20]. The fact that no single GC provides the simplest results for all programs motivates these efforts. In terms of approach, this line of work gives the developer no control and prevents the mixing of different GC designs within the same program. Shoaib et al. [21] described a concept called the Write Rationing GC in Big Data Processing, which moves objects with a large/small number of writes into DRAM/ NVM to extend the lifetime of the NVM. NVM for 2 Wireless Communications and Mobile Computing managed programs is supported by approaches like Espresso [22]. Nowadays, C++ programs, as opposed to C programs, make extensive use of dynamic memory for short-term allocations, which often results in faster access to objects and thus increases program efficiency [23]. In comparison to C programs, studies have shown that C++ programs invoke dynamically created objects much faster and with fewer errors [24]. The challenges that modern memory management systems face are exemplified by Memcached. In modern web architectures, Memcached is widely used for caching temporary data [25]. Facebook and Twitter, for example, make extensive use of the technology to reduce database server load and rely on a 99 percent hit rate to scale to their massive user bases [26].
Automatic cached memory cleanup in mobile apps, as described by Umar Farooq [27], can reduce program complexity to a much greater extent and aid in the smooth running of apps. As a result, using multiple CPUs to increase a system's computation power is usually ineffective and becomes a bottleneck [28]. C and C++ programs rely heavily on dynamic memory allocation, and the related key functions (malloc() and free() for C, new() and delete() for C++ programs) have long been required parts of standard libraries. The researchers are now working on creating a standard dynamic memory allocation technique/algorithm that is more efficient in terms of speed, performance, and memory than previous ones. There exist many applications of efficient resource management including testing of object-oriented software [29], multimedia optimizations [30,31], and mathematical optimizations [32]. Memory and energy efficiency become a prominent criterion in many scenarios, such as system on chip (SoC) [33], edge computing, federated machine learning, cluster computing, and Internet of everything I(OE). Sundari et al. [33] discuss energy-efficient SoC memory management techniques. Some recent energyefficient static and dynamic memory management techniques can be found in the works given in [34][35][36].

Design Goals
Performance: the first objective is to create a memory manager that outperforms the memory managers included with the default language. Concurrent memory allocations and deallocations must not cause any performance degradation.
Novelty: the memory manager should be upgraded to manage repeated assignment patterns in the code and to optimize their performance accordingly.
Platform independent: the memory manager design should be independent of any particular system and should be portable across platforms without relying on platformspecific dependencies.
Ease of use: when users incorporate a memory manager into their code, they should only need to change a small amount of code.
Robustness: the memory manager must not leave any traces after its use has ended and must restore the requested traces before the system terminates. This prevents the memory from being leaked. The memory manager should handle all error cases.

Strategies Used in Design
Request memory in large chunks: during program startup and then intermittently during code creation, one of the most popular memory management techniques is to request large memory combinations. Memory allocation requests for specific data structures are documented in these frameworks. As a result, fewer system calls are made and operating time is increased.
Allocation pattern optimization: in any system, certain request sizes for specific applications are more prevalent than others. Your memory manager will perform admirably if it is optimized to better handle these requests.
Memory deallocation to optimize operating system calls: during execution, memory should be integrated into containers. Additional memory requests should be serviced by these containers. If the call is unsuccessful, memory access should be transferred to one of the large chunks allocated during the startup process. While memory management is intended to improve system performance and prevent memory leaks, this approach may result in a smaller program's memory footprint due to the reuse of deleted memory.

Implementations and Performance Analysis
The default new and delete operators in C++ for allocating and deallocating memory have some limitations, which we can overcome by writing an optimized and efficient memory management algorithm/code that makes use of the concepts of computing, caching, memory, and data structures that we have learned thus far. Operator new executes in a nondeterministic manner. When we call new, the operating system may or may not allocate a new physical page to the process, which can be quite slow if we do so frequently. When new is called, system looks for a memory block large enough to hold our request. Additionally, we discuss memory fragmentation.
For example, if we allocate 10 KB from the middle of a 20 MB chunk, we cannot allocate the remaining 20 MB in one go. If we access a memory region but do not free it, we have a memory leak. If there are infinite memory allocation operations, the system's memory will be rapidly depleted, and the system will crash. New and delete operators consume a significant amount of time for allocation and deallocation purposes, which can have a greater impact on the speed of a C++ application at a higher level.
For instance, suppose we are given the task of creating 1000 objects and are required to create and destroy them 500000 times in a given context. This equates to 500000 × 1000 × 2 (allocations + deallocations). If we use the default new and delete operators for this purpose, the benchmarking of the preceding example results in a processing time of 30.469 seconds on a particular computer.

Wireless Communications and Mobile Computing
We discovered that for this particular problem of allocating and disposing of approximately 1000 objects in a cycle, we can effectively reuse more than 70 percent of the objects. This can be accomplished by reusing memory and employing compact, contagious data structures. Our approach would be to create a memory manager object using templates to determine the type of class for which we will create objects and also to pass it the number of objects to create in a single cycle (in our example 1000). In our example, we will use a simple user-defined class that the memory manager will allocate and deallocate. Now we will discuss Algorithm 1, which is the benchmarking routine that the following memory managers will follow.
(1) Memory Manager::nxtAddress(), returns the address of memory. The memory size is equal to the class size (sizeof operator), here MyPracticalClass (2) MemoryManager::freeAddress(), is supposed to destruct the memory pointed by the pointer and later reuse that memory for another object allocation (3) For getting the running time of all the implementations we will be using chrono time library of C++ (4) All allocators request memory from operating system using either malloc or new operator. In our implementations we are using malloc and use some portion of this memory to serve an object whenever required Also, we will reuse the memory whenever the previously allocated object does not require it anymore.

How It Works.
We will first initialize a large chunk of memory. The size of this pool is equal to the size of class (here MyPracticalClass) multiplied the number of objects (here 1000) we will be allocating and deallocating in one cycle (here no. of cycle is 500000). Whenever we call MemoryManager::nxtAddress() function, it returns a pointer to a memory address of sizeof(MyPracticalClass) anywhere from the pool. We can think of the pool as an array of empty objects and size of this array we have already calculated above (here 1000). The empty objects serve as the memory for actual object we want to create and use. Whenever we want to create an object, we call the function MemoryManager::nxtAddress().

Wireless Communications and Mobile Computing
This returns the memory for the object. We then initialize the object using initialize member function on the class. The memory manager now sees that the memory has been served to some object.
Next time whenever we want to allocate memory for another object the memory manager returns the next memory address that is free. Figure 1 shows the allocation of two objects in the memory pool.
In our example, we have 1000 allocations in one cycle so after all the memory address are returned, our pool will look like as shown in Figure 2. Now, the next we do is 1000 deallocations. This is done by calling MemoryManager::freeAddress() with memory address of object as parameter, i.e., pointer to the object. After 1000 deallocations, the SSDAM pool will look like this. Now, 1 cycle out of 500000 cycles of 1000 allocations and deallocations is done. In the next cycle, this memory manager's pool can be reused to do 1000 allocations and deallocations. As one can see, SSDAM follows the principle of reusing memory and using compact and contagious data structure. Memory pool becomes free of objects after each cycle as shown in Figure 3.          Wireless Communications and Mobile Computing 1 MemoryManager(class T, poolObjCount, count) 2 { 3 typeSize = sizeof(T) 4 / * prev points to complex node before to current complex node * / 5 / * next points to complex node next to current complex node * / 7 TYPE Link { prev, next} 8 linkSize = sizeofPoolLink 9 / * size of complex node * / 10 typePlusLinkSize = typeSize + linkSize 11 sRef = <link * > malloc(typePlusLinkSize) 12 sRef-> prev = sRef-> next = null 13  benchmarking. This means that, from our principle of reusing memory and using compact and contagious data structures, we will not be using compact and contagious data structures here. Instead, we will be using linked list memory. We will have a number of nodes equal to the number of objects that the user wants to create, and these are linked using a doubly linked list. Each node has two pointers for referring to the back and front nodes. Along with these two pointers, there is a data field. This data field is what we will be using to store the information for a memory address where we can initialize our object.

How It Works.
As in the case of SSDAM, we have initialized a memory pool at the beginning before executing the main code. But in DLLOM, we are not initializing or creating a node. Instead, whenever we want to create an object, we call the MemoryManager:: nxtAddress() function, which creates a node, saves its address in a doubly linked list chain, and returns the value of the created node's data field. The data field is the pointer to the memory size of the object we want to create. The doubly linked list is now represented as follows. This node has a blue part also, not only green because the green portion fully describes the doubly linked list. Hence, DLLOM is not purely a doubly linked list, but rather a complex doubly linked list node. So, the complex node memory returned can be divided into two parts. Use the first part to store doubly linked list data, and the second part is reserved for our object space. Now, the memory manager will be maintaining the chain of these complex nodes. Use the doubly linked list to link the complex nodes and the other parts of the complex node as a memory pool for our single-class object. Further objects are allocated in a similar fashion, and the complex chain can now be represented as shown in Figures 4 and 5.
In the case of deallocation of objects, whenever an object is freed at any part of the chain, it is pushed to the end of the chain. This complex node can now be reused in the sense that its object memory pool (orange part) can be used to serve memory requests for another object allocation. When there is another object deallocated, the list becomes as shown in Figure 6.
Multiple free nodes can be seen from Figure 7. Now, whenever we request the memory manager to return memory, it will look for the first free link at the end of the chain and return the object memory. Then, the list will look like the one shown in Figure 8.
After next allocation, no free complex nodes are left (Figure 9).
Since no free complex nodes are left, this means that if there is one more memory request to memory manager then it will have to create a new complex node and then return the memory from that complex node. A new complex node is added at the end of the chain (Figure 10).

Wireless Communications and Mobile Computing
This feature of DLLOM where it can allocate as many nodes as it requires makes the DLLOM chain of some specific length and hence does not restrict the number of objects that can be allocated in a given cycle. So DLLOM is more flexible than SSDAM in serving any number of memory requests. For deallocation of 1000 objects, the process of freeing and pushing the freed complex node to the end of the chain is repeated 1000 times. After that, as shown in Figure 11, we have all the free nodes in the chain.
Now, 1 cycle out of 500000 cycles of 1000 allocations and deallocations is done. In the next cycle, this memory manager's pool can be reused to do 1000 allocations and deallocations in the way we have discussed above.

Benchmark.
The DLLOM approach gave the average running time of 7.496 sec. So, it is slower than SSDAM but it is still faster than general new/delete approach by a factor of around 4.  Wireless Communications and Mobile Computing case, 1000 * sizeof (MyPracticalClass)). If we wanted to say more than 1000 objects in one cycle, say 1000000, the operating system will either return memory or not. In the event that it does not return memory, a runtime error will be thrown, or if it returns memory, the memory space that we think is contagious might not be physically contagious, which can degrade our program performance in terms of more CPU request cycles, more indirections, and probable cache misses. Now to deal with that SSDAM, instead of requesting a single large contagious pool of memory, request multiple small contagious pools of memory and connect those using a singly linked list.

How It Works.
Small contagious pools of memory connected using singly linked list will be managed by memory manager and use it to serve memory request for our object allocations. Similar to the concept of complex node in DLLOM approach, we also have complex nodes in SSDAM-E. Shown in Figure 12 is a single complex node of SSDAM-E single linked list. So, the complex node memory returned can be divided into two parts. Use the first part to store singly linked list data and second part is memory pool to serve some objects. The first part is represented as green and second part can be thought of SSDAM memory pool. Pool is divided into empty objects, and when they are empty, they are orange and when they are occupied/referred by some object they are blue similar to concept of SSDAM. In SSDAM-E, instead of allocating 1000 objects in a single memory pool during one cycle, we will allocate 100 objects in each individual pool out of total number of objects (here 1000) we need to allocate in one cycle. So, we will have 1000/100 = 10 pools and size of each pool should be 100 * sizeof(MyPracticalClass). These pools are connected using a singly linked list chain internally by memory manager using the complex singly node. Now, there are 10 complex nodes. For sake of simplicity, we are showing two complex nodes depicted in Figure 13.
After first 100 object allocations done by the SSDAM-E memory manager, the first complex node is exhausted and will look like as shown in Figure 14.
The free orange memory areas are now occupied by 100 objects, and thus, the memory manager turns blue. The next complex node in the chain will now be used linearly from left to right for the next 100 object allocations, filling all 100 memory locations that its pool can provide. Figure 15 shows a singly linked list with no free complex nodes.
The next 800 objects consume the next 8 complex nodes. So, after 1000 allocations (done in one cycle out of 500000) the SSDAM-E is full. The next phase is deallocation of 1000 objects in the same cycle. This is done by calling Mem-oryManager::freeAddress() with memory address of object as parameter, i.e., pointer to the object. After 1000 deallocations, all 10 complex nodes will have their memory pool object memories freed as shown in Figure 16. Now, 1 cycle (out of 500000 cycles) of 1000 allocations and deallocations is done. In the next cycle, the memory manager's 10 complex nodes' pool memory can be reused to do the next 1000 allocations and deallocations similar to the process described above. In SSDAM-E, we can have two cases: the best case and the worst case. In the best case, we have a memory pool of 1000/1000 = 1, i.e., we will have only one pool of memory of size 1000. In the worst case, the number of memory pools will be 1000/1 = 1000, i.e., we have 1000 memory pools of size 1.

5.3.3.
Benchmarks. The best case of SSDAM-E gave the average running time of less than 3.800 seconds which is 2.7-5% faster than SSDAM. The worst-case SSDAM-E gave the average running time of 5.650 sec which is faster than DLLOM average time of 7.496 sec.

Results and Discussion
The following situations were considered and compiled to get the results.     Figures 19 and 20 depict the visual representation for running time against number of objects allocated/deallocated per cycle and number of cycles, respectively. We have discussed the implementations and trade-offs between them. We first performed a single benchmark test, keeping fixed the number of cycles of allocation-deallocation and the number of objects allocated in one cycle. In the case of SSDAM, we also added variations (in one case, we used the placement new operator rather than the class initialize function, and in the other case, we used new and delete operator overloading within our class, i.e., MyPracticalClass). Below graphs show the performance against each other when the computation at hand varies.

Conclusion
This article discusses various concepts and implementations of memory management techniques that can be used at the source code level and are designed to be pragmatic in their application. All of the approaches discussed far have a low runtime overhead and are thus applicable to a wide variety of use cases. In the future, we intend to investigate the integration aspects of the approaches discussed here and to attempt to apply them to existing systems that are in production and known to have memory performance issues. The inherent drawbacks of memory management operators are highlighted. The application of these operators is intended to be extremely generic, much like the concept of dynamic memory. As a result, they are unable to utilize the various optimization techniques and opportunities that particular use cases present. Each source code file is modeled after its own unique memory usage pattern, which speeds up memory management procedures. The SSDAM, SSDAM-E, and DLLOM strategies have been evaluated and compared to the performance of the new and delete operators. SSDAM-E, SSDAM with new delete operators, and DDLOM reduce memory usage by 8.01, 7.0, and 4.0 times, respectively. In the worst-case scenario, SSDAM-E provided an average execution time 5.650 seconds faster than DLLOM. As far as energy efficiency is concerned, SSDAM-Original and SSDAM-E-Original achieve 100 percent, whereas new/delete operators have a baseline efficiency of 12.48 percent.

Data Availability
Data and code are available with authors.