Efficient Processing of Image Processing Applications on CPU / GPU

Department of Computer Science, University of Peshawar, Peshawar, Pakistan Department of Computer Science, Institute of Management Sciences, Peshawar, Pakistan Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia Institute of Computing, Kohat University of Science and Technology, Kohat, Pakistan Department of Computer Engineering, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia


Introduction
GPUs (graphical processing units) are becoming popular to exploit data-level parallelism [1] in embarrassingly parallel applications [2] because of the SIMD (single instruction multiple data) [3,4] architecture. However, task-level parallelism [5][6][7] is better exploited in general-purpose processors, because of MIMD (multiple instruction multiple data) [8] architecture. More recently, general-purpose processors are combined with GPUs and provide general-purpose computation accelerated with GPU (commonly referred as GPGPU) [9]. A heterogeneous cluster has many CPUs and GPUs and can exploit both task-level and datalevel parallelism in applications. e amount of data generated in the form of images and videos is enormous because of the fact that many surveillance cameras, smart phones, and many other devices that capture images/videos are installed/used everywhere. ese devices are constantly recording scenes and can be processed for different types of computer vision, image processing, machine learning, and data science tasks. Images inherently have data-level parallelism [1,10] (i.e., individual images can be processed in embarrassingly parallel fashion), because all pixels can be processed independently and therefore GPUs are commonly used for image processing applications. A heterogeneous cluster (i.e., a cluster containing CPUs and GPUs) can exploit both task-level (across multiple images) and data-level parallelism in images. GPUs enable heterogeneous clusters to accelerate the processing intensive operations of images in big data and machine learning applications.
Data processing applications are commonly processed in a cloud environment using the MapReduce [11] parallel processing model for efficient execution. e scheduler of the MapReduce influences the performance in different applications when utilized in a heterogeneous cluster. e dynamic nature of the cluster and the computing workload affect the execution time of the application. Data locality is an essential part to reduce the total application execution time and hence improve the overall throughput.
In current technology, it is very challenging to provide a mechanism that can efficiently utilize the computing resources in a heterogeneous cluster [12]. Traditionally, applications that use both CPUs and GPUs are created using low level programming languages. Limited support is available to programmers to efficiently exploit parallelism in these clusters. A heterogeneous cluster normally has nodes of different computational capabilities [13]. For instance, a node may have 4 CPUs and 1 GPU, and another node may have 16 CPUs and 8 GPUs. erefore, static distribution of workload amongst these nodes without taking into account their processing capabilities is not justified [14] as one node will be overloaded, and the other node will remain idle for most of the time [15]. Load balancing can be done dynamically, but it requires extra overheads at runtime [16].
In this paper, we present a new technique for the efficient distribution of images in a heterogeneous cluster. e goal is to maximize the utilization of the processing resources [17][18][19] (i.e., both CPUs and GPUs) and throughput. We provide a programming framework that ensures efficient workload distribution amongst the nodes by dividing the data into equal size splits and then distribute the split data between CPU and GPU cores based on their computational capabilities [20]. e aim is to achieve maximum resource utilization, gain high performance by dividing the data into equal size splits, allocate the data locally to the computing units, and minimize data migration to GPUs [21]. e rest of the paper is organized as follows. Related studies are given in Section 2. In Section 3, we provide a background to existing work that use CPU and GPU integration to efficiently solve different problems. We also provide a brief overview of the Hadoop programming framework, as this framework is used to test the ideas presented in this paper. We highlight limitations in existing state-of-the-art frameworks and explain the problems due to which efficient utilization of computing resources is not possible. e details of the proposed framework are given in Section 4, and demonstration of experiments is given in Section 5. We conclude the paper in Section 6.

Related Work
In [22], image processing for face detection and tracking is performed using CPU and GPU integration in Hadoop framework and has improved the performance by 25%. In another research [23], GPU-based Hadoop framework is used for evaluation of Canny Edge Detection algorithm using the default scheduler of Hadoop for workload distribution and has demonstrated two times performance improvement than HIPI-based image processing [24]. In [25,26], face detection in video frames is performed by using CUDA-based Hadoop framework. It is observed that actual data processing takes 55% of the processor time, and the remaining 45% is wasted as idle or busy in performing other management activities.
Another framework named SEIP (system for efficient image processing) [27] ensured high performance for image processing application by applying in-node pipeline framework. However, during the processing, it is observed that the number of load/store to the GPU is equal to the number of images, which results into overhead and performance loss in case of processing a large number of small images, and also there is no policy of workload distribution between CPU and GPU. In [28], an integrated framework based on Hadoop and GPU has been developed for processing massive amount of satellite images. In order to achieve application performance, image data are split into parts and then each part is allocated to processing units, but there is no support for efficient workload distribution between CPU and GPU.
An energy efficient runtime mapping and thread partitioning approach has been developed for distribution of concurrent OpenCL application between CPU and GPU cores and has demonstrated a 32% increase in the system performance [29]. e feature extraction algorithm SIFT [30] has been developed in OpenCL that distributes workload on CPU and GPU. It is demonstrated that features were extracted with more than 30 FPS (frames per second) on full HD images, and an average speed up of 2.69 was achieved. Another OpenCL-based framework single node vertically scaled system is developed in [31] where multiple GPUs are combined together and treated as a single computing device. It automatically distributes OpenCL kernel written for a single GPU into multiple CUDA kernels at runtime that is executed on multiple (eight) GPUs. e experiment was performed by combining 8 GPUs in a single node environment, and the performance speedup of 7.1x was achieved as compared to the performance of single GPU. An average overhead of 0.48% is reported.
In [32], an algorithm is proposed that improves the efficiency of Hadoop clusters. e experiments demonstrated that if a process is defined that can handle different use-case scenarios, the overall cost of computing can be reduced and get benefits from distributed system for fast executions. A reinforcement learning-based MapReduce scheduler is proposed in [33] for heterogeneous environment. e system observes the state of task execution and suggests speculative execution of slow tasks to other free nodes in the cluster for faster execution. e proposed approach does not need any prior knowledge of the environment and adapts itself to the heterogeneous environment. e experiments demonstrate that over a few runs, the system can better map the tasks to the available resources in a heterogeneous cluster and hence improve the overall performance of the system. e workload partition and task granularity for a given application based on machine learning techniques are given in [34]. e machine learning model can train a predictive model off-line, and then the trained model can predict the data partition and task granularity for any program at runtime. e experiments demonstrate a 1.6× average speedup using a single MIC. Other studies that involve the improvement in parallel computation are given in [35][36][37][38].
In all techniques presented in this section, the objective is to improve the performance of execution in a heterogeneous environment.
e current state-of-the-art techniques demonstrate that imbalanced distribution of tasks without considering the underlying computational capabilities results into inefficient execution of the applications. e proposed technique in this paper considers the underlying computational capability of the processor and then assigns tasks. is results into performance improvement as demonstrated in Section 5. To the best of our knowledge, no previous studies have addressed the problem in the same perspective as undertaken by this research.

Background
In this section, existing frameworks are explained along with the limitations.
CUDA (compute unified device architecture) [56] is the most commonly used programming language for GPUs developed by Nvidia. CUDA integrated with Hadoop enhances the application throughput by using the distributed computing capability of Hadoop and parallel processing capability of GPU [57]. Mars framework [58], which has been used for processing of web documents (searches and logs), was the first framework which combined GPU with Hadoop. Some other popular frameworks that integrate GPUs with Hadoop are MAPCG [59], StreamMR [60], and GPMR [61]. However, these frameworks are developed for some specific projects and do not improve the performance of image processing applications in Hadoop. Hadoop Image processing interface (HIPI) has been developed that can efficiently process a massive amount of small sized images but has no support of GPU [24].
A heterogeneous Hadooop cluster with nodes equipped with GPUs is shown in Figure 1. Hadoop is one of the famous and easy to use platforms which is a loosely coupled architecture and provides a distributed environment. Hadoop consists of the Hadoop distributed file system (HDFS) [62] and MapReduce [46] programming model.
HDFS is an open-source implementation of the Google File System (GFS). It stores the data on different data nodes in a cluster. HDFS depends on the mechanism of the masterslave architecture. e access permission and data service to the slave nodes are provided by master node also known as name node, while the slaves, known as data nodes, are used as storage for the HDFS. Large files are handled efficiently by dividing them into chunks and then distributed amongst multiple data nodes. On each node, to process their local copies, a map processing job is located. e function of name node is only to keep record of the metadata and log information, while Hadoop API is used for the transfer of data to and from HDFS. MapReduce is an enhanced approach which provides an abstraction for data synchronization, load balancing, and dynamically allocation of tasks to different computing units in a reliable manner.

Limitations in Existing
Systems. Some issues that lead to imbalanced workload distribution in a heterogeneous environment are discussed below.

Data Locality.
Data locality means that the mapper and data are located on the same node. If data and mapper are on the same node, than it is easy for a mapper to efficiently map the data for computation, but if the data are on a different node than the mapper, then the mapper have to load data from different node over the network to be distributed. Suppose there are 50 mappers which try to copy data from other data nodes simultaneously. is situation leads to high network congestion which is not desirable because the overall performance of the application is affected. e situation where mapper and data are on the same node is shown in Figure 2(a), and the situation where mapper and data are on different nodes is shown in Figure 2(b). For efficient processing of applications, a programming framework should be able to ensure data locality. When the data stored in HDFS (Hadoop distributed file system) is distributed amongst the nodes, data locality needs to be handled very carefully. e data are divided into splits, and each split is provided to the data node in the cluster for processing. e MapReduce job is executed to map splits to individual mapper that will process the assigned split.
at means that moving the computation closer to the data is better than moving the data closer to the computation. Hence, good data locality means good application performance.

Split Size.
For efficient execution of programs in heterogeneous cluster, the programming framework should be able to distribute the data evenly into splits and then distribute amongst the available nodes. One of the main characteristics of MapReduce is to divide the whole data into chunks/input splits according to the block size of HDFS. As by default, Hadoop block size is 64 MB, and the issue of data locality arises when the input split size is larger than the block size. For better performance, input split size should be equal to or less than the block size of HDFS.
Block size and split size are not the same terms, as the block size is the physical chunk of data stored in disk, whereas input split size is the logical chunk of data with pointers for start and end locations in a block. When the split size is more or very small then the default block size, then uneven distribution of data happens, which leads to issues in data locality and memory wastage.
In Figure 3, the scenario of uneven split size is highlighted. Suppose the block size is 64 MB and each split size is 50 MB. e first split will easily fit in block 1, but the second split starts after the first split ending point and will not fully fit in the block 1, so the remaining part of the second split will be partially stored in block 1 and partially stored in block 2. When the mapper is assigned to block 1, it reads the first split, it will not read the second split data as it is not fully fitted in block 1 and cannot generate any final result of the second split data. According to [39,63,64], as quoted from the book " e logical records that FileInputFormats define do not usually fit neatly into HDFS blocks. For example, a TextInputFormat's logical records are lines, which will cross HDFS boundaries more often than not. is has no bearing on the functioning of your program-lines are not missed or broken, for example-but it's worth knowing about, as it does mean that data-local maps (that is, maps that are running on the same host as their input data) will perform some remote reads. e slight overhead this causes is not normally significant." If a record/file span across the HDFS boundaries of two nodes, then one of the nodes will perform some remote reads to fetch the missing piece. And it will read the data and generate final results but with the overhead of communication between the two nodes. Hence, the communication overhead arises because the data are not evenly distributed according to the block size, and most of the time, mapper waits for other mappers to generate the result and then to synchronize with each other for final result.
is problem can be solved by arranging the whole data into ideal input split size. By dividing the data into equal and suitable split size that is less than or equal to the default block size, the mapper of each block will read its data easily and will not wait for other mapper to send data of the split that is partially stored in different blocks.

Data Migration and Inefficient Resource Utilization.
Data migration is the process of transferring data from one node or processor to another node or processor as shown in Figure 4. Data migration between systems is usually performed programmatically to achieve better performance, but in heterogeneous systems, where a node contains CPU and GPU, the GPU being the faster computing processor will complete its task quickly and will fetch the data from CPU, a slow computing processor. Due to this data migration, the scheduler will always be busy in managing the tasks scheduling, the GPU will be idle, and hence performance in applications is affected.
Above are some of the problems that need to be tackled while using heterogeneous systems, so that application performance can be increased. In this paper, we will integrate CUDA with Hadoop to increase the performance in processing images in heterogeneous clusters. We will integrate the Hadoop platform that is used for distributed processing on clusters with libraries that allow code to be executed on GPUs.

Efficient Workload Distribution Based on
Processor Speed e issues of data locality, input split size, data migration, and inefficient resource utilization discussed in Section 3.2 lead to imbalanced workload distribution in heterogeneous environment, which results into inefficient execution of applications.
e proposed framework will distribute the data in a balanced form amongst the nodes according to their computing capabilities, as shown in Figure 5. e distribution of data in the framework consists of two phases. In Phase I, the data are distributed amongst the nodes and in Phase II the workload is equally distributed between CPU and GPU based on their computing capability. Both phases are explained below in detail.

Phase I: Data Distribution amongst Nodes.
In Hadoop, workload is organized in splits which are then distributed amongst nodes in the cluster to be processed. However, this workload is not evenly distributed and hence processors with lower processing capabilities are overwhelmed. In the proposed framework, the input is in the form of images and is distributed evenly amongst cluster nodes in order to utilize computation and memory resources efficiently. We have developed a novel distribution policy for the even distribution of same size images in splits, where a split contains one or more images. We are focusing on same sized images only, in order to avoid communication overhead when images of different sizes are loaded. With images of different sizes, the ratio calculations explained in Section 4.2 are useless. is idea is mainly inspired from arrays, where continuous blocks of the same size are gathered together,    e distribution policy does not allow a single image to be distributed to multiple splits, because of data locality issue. e ideal split size is set according to the size of the images, so that an image does not exceed the boundary of that split. Multiple images evenly grouped together in a split and ready to be distributed amongst nodes is shown in Figure 6, and the ideal split size is calculated as per the default block size (i.e., 64 MB in Hadoop) as shown in Figure 7.
To avoid the problem of uneven splits, i.e., when an image is distributed in multiple splits, the split size is set very carefully based on the image size. We first measure the size of one image and then select an input split size where multiple images of that size will be placed. Let I be the ideal input split size which need to be computed, d be the default split size in Hadoop, s be the size of one of the input images, n o be the number of images that can be fully accommodated by the default input split, T i be the total number of images in a dataset, and S n be the total number of splits in which the data are divided equally. To calculate the number of splits, the data should be arranged in S n � 90/15 � 6, where 90 is the total number of input images. So, we will have a total of 6 input splits each of size 63 MB to store 90 input images. (1)

Phase II: Distribution of Workload on CPU and GPU within a Node.
In a heterogeneous cluster, every node is equipped with a GPU, which is much faster than the CPU. erefore, for efficient execution of applications, it is important that tasks are distributed based on the computing capabilities of processors. e proposed workload distribution scheme for heterogeneous Hadoop cluster is shown in Figure 8. A split is a container of a group of images of the same sizes. For every split, the map function is invoked, which takes the input < key, value > pair, where the key contains log file of images in a split and the value contains  contents of images (bytes). e map function processes the split and reads each image in the split. In the proposed algorithm, the map function takes the split and checks the ratio for all the available images in that split so that images can be assigned to CPU and GPU according to their computing capability. Before assigning images to the CPU and GPU, a sample image is executed on both the CPU and GPU to find out the execution time of both the processors on a specified algorithm. From this execution time, the processing power of each processor is identified, and a ratio is calculated, demonstrating the number of images assigned to CPU and GPU.
In order to compute the computing capability of processors in a node, we initially assign a raw image to both CPU and GPU to find the execution time of both processors. Let c be the execution time in CPU and g be the execution time in GPU to execute an image processing algorithm on a raw input image. We take the ratio of max(g, c) and min(g, c). e device with larger execution time (i.e., slower processor) is assigned a value 1, and the device with smaller execution time (i.e. faster processor) is assigned an integer value of the execution time of slow processor divided by execution time of fast processor. When the processor fetches images, we assign nP x number of images to the slower processor and nP y images to the faster processor, as shown in the following equation: ,
is novel distribution of workload ensures that images are distributed based on the computing capability of the processor, which can speedup the execution of images in heterogeneous nodes. e flow chart of the proposed framework is shown in Figure 9.

Evaluation
In Section 4, the proposed framework is introduced, to efficiently handle the imbalanced workload distribution in a heterogeneous Hadoop cluster for image processing applications, and the implementation of such policy is evaluated by using the commodity computer systems accelerated with GPU. As the heterogeneous systems increase the performance of applications by processing a massive amount of data in parallel, in the future, these heterogeneous systems will be commonly used and provide efficient execution of programs, by adopting policies for efficient workload distribution. In this section, the evaluation of the proposed programming framework is discussed. e detail of the dataset used for the evaluation is given in Table 1. e test environment includes a master node having a corei5 processor with speed 2.5 GHz, 8 GB RAM, and 4 processing cores. Two worker nodes are used each having a CPU, corei3, 1.8 GHz, 4 GB RAM, and 4 cores and a GPU NVIDIA 802 M, 64 cores, and 2 GB RAM.

Processing Images of Different Sizes.
e Figure 10(a) shows the average execution time calculated in milliseconds for edge detection algorithm on four images of different sizes on a CPU and GPU, respectively.
is experiment demonstrates that, on CPU, execution time of the application increases when the size of the image is increased. However, on GPU, the increase in execution time when processing an    application on a very large image is not significant. is analysis demonstrates that GPU is a better choice for processing larger images. Figure 10(b), the performance of the application in processing different number of images on a GPU is shown. Y-axis shows execution time in millisecond and x-axis shows the number of images of different sizes. Four images with different resolution sizes are grouped together as 1 image, 2 images, 3 images, and 4 images. For instance, when the application processes a single image of size 1024 × 768, the execution time on GPU is 42 milliseconds. But, in processing four images of the same size, the GPU takes 51 milliseconds. Similarly, a single image of size 2560 × 1440 takes 69 ms, but processing four images of the same size takes 87 ms. is increase in execution time is mainly because of the communication overhead, as each image is migrated from CPU memory to GPU memory and then the result is written back to the CPU memory.

Performance Comparison.
We compare the performance of our approach, i.e., modified split sizes and optimized workload distribution based on the computation capabilities of processors in a node of heterogeneous system with existing state-of-the-art techniques such as HIPI, HIPI executed on GPU, and Hadoop executed on GPU, as shown in Figure 10   images faster than the other state-of-the-art frameworks. For instance, processing 90 images of resolutions 2560 × 1440 each of size 11.1 MB takes 686 ms in HIPI, but only 191 ms in the proposed framework, i.e., a speedup of 3.5, is achieved.

Mathematical Problems in Engineering
is speedup is possible mainly because of the efficient workload distribution and the efficient arrangement of images in input splits which results into the efficient utilization of resources.

Effect of Assigning Different Loads to CPU and GPU.
In the proposed approach, we have developed a novel technique to compute the ratio of images to be assigned to CPU and GPU in a heterogeneous node. e performance of this calculation is shown in Figure 11(a) along with comparison with other state-of-the-art techniques. It is observed that application processing the dataset of different images of different sizes executes efficiently on the proposed framework. For instance, the speedup in the proposed approach is 2.12x compared to HIPI. Hence, it is proved that, by efficient load balancing techniques in heterogeneous systems, we can increase the performance by two times.

Execution Time of Each
Split. Input data are divided in different input splits and distributed amongst nodes, and from each split, images are accessed and processed. e average execution time taken by images of sizes (1024 × 768, 1600 × 900, 1920 × 1080, and 2560 × 1440) in an input split is shown in Figure 11(b) for all the four platforms while processing edge detection application. It is demonstrated that the proposed framework processes the images more than two and half times faster than HIPI. Figure 11(c), an average of total execution time for the whole dataset shown in Table 1 is calculated for image processing application performing edge detection operation. e total execution time includes partitioning of image data into splits, distribution amongst the nodes, and on each node, further distribution between CPU and GPU, and after processing, the result is written back to the HDFS. From the figure, it is clearly shown that, by adopting the proposed approach (Hadoop + GPU) where data is divided into ideal split size and then distributed to the processors according to their computing capabilities, the total execution time calculated in milliseconds is two times less than the other existing platforms.

Minimization of the Communication Overhead by Ideal Split Size Selection.
e communication overhead of images during processing is shown in Figure 11(d). e access time of an image in this figure includes the time from split to the assigned processor. e execution time taken by a single image in recorded. From the experiment, it is observed that, by arranging the data local to the computing processor, the images can be easily accessed and processed. As HIPI has an overhead of data compression and decompression; therefore, the results of both HIPI and HIPI + GPU are relatively same and high. By using Hadoop + GPU, the image access time is less compared to the HIPI and HIPI + GPU, but by overcoming the data locality issue in the proposed framework (Hadoop + GPU), the image access time is almost two times reduced.

Conclusion and Future Work
Distributed systems provide parallel frameworks where massive amount of data can be efficiently processed. Hadoop is one of those frameworks that provides large amount of data storage and effective computational capability to handle massive amount of data in a parallel and distributed manner by using the cluster of commodity computer systems. ese clusters also contain GPUs, as they are specialized to handle SIMD efficiently. For image processing applications or applications involving the process of big data in different domains such as healthcare, heterogeneous systems are commonly used and have shown improvement over single processor and distributed systems. However, imbalanced workload distribution between the processor causes data locality and inefficient workload distribution between slow and fast processors can affect the performance of applications. To deal with this imbalanced workload distribution in a heterogeneous cluster, this paper has proposed a novel technique of dividing data into ideal input splits so that an image is included in one split and does not exceed the boundary of that split. is paper also introduces a technique of distributing data as per the computing power of the processors in the node.
is distribution maximizes the utilization of available resources and tackles the issue of data migration between the fast and slow computing processors.
e results have demonstrated that the proposed framework achieves almost two times improvement compared to the current state-of-the-art programming frameworks. e proposed framework provides an efficient mechanism to compute an ideal split size that is suitable for fixed size images but does not provide support for partitioning variable size images into splits. In the future, we will investigate techniques that can compute ideal split size for variable sized images. In the future, we will investigate techniques that can compute ideal split size for variable sized images which will enable us to process images from sources. e proposed model can be extended for big data applications which have inherent data-level parallelism in the form of arrays, matrices, images, or tables.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.