Heterogeneous Hadoop Cluster-Based Image Processing Workload Distribution Framework between CPU and GPU

,


Introduction
A vast quantity of digital data is created everywhere in today's technology world from many sources such as the Internet, networked cameras, mobile phones, sensors, and so on.Digital data used to be measured in Megabytes and Gigabytes, but today it is measured in terabytes and petabytes.Because 70% of digital data is unstructured, such a large volume of data needs additional storage and processing power [1].Unstructured data includes images, which are a two-dimensional representation of pixels with variable intensity values, among other forms.In addition, images include intrinsic data-level parallelism that must be handled to extract relevant information, a process termed image processing.It is useful in a variety of applications, including medical imaging, satellite imaging, document analysis, and so on.
Te constraints of limited memory capacity and data access speed arise when storing such a vast amount of data on a single processing device, impacting the performance of the application.Due to these problems and the programs' high computing demands, it was impossible to improve the performance of a single CPU or processing system.Due to the need to maximize the efciency of running data independently of one another across several computing units in parallel, distributed and parallel architectures had to be developed.
Hadoop is a well-known and simple-to-use technology with a loosely linked design and distributed environment.Te Hadoop Distributed File System (HDFS) and the MapReduce programming methodology make up Hadoop.Google's fle system (GFS) is based on HDFS, which is a free and open-source version of it.In essence, it uses a mapreduce method to distribute enormous amounts of data across commodity computers, which is used by a variety of companies including Yahoo, Amazon, Facebook, and Google.Te maps reduction technique provides dependable data synchronization, load balancing, as well as dynamic allocation of jobs among multiple, compute units.
Te GPU architecture is confgured so that the hardware has several multiprocessors.Each multiprocessor is composed of a collection of SIMD (Single Instruction Multiple Data) architecture-based 32-bit processors.Every clock cycle, a multiprocessor executes the same instruction on a group of threads known as a warp.Te quantity of threads in the warp determines the size of the warp.Each streaming multiprocessor (SM) has 8 scalar thread processors (SP), and the block's threads share 16 kb of on-chip memory for communication.Programmers write two diferent types of code for GPU execution: kernel and host code.Te kernel code is executed concurrently on the GPU.Te CPU's host code manages data transmission between the GPU and main memory, as well as starting kernels on the GPU.
Massive parallel processing cores combined with GPU are provided by heterogeneous clusters, which give high speed and scalability data dissemination to faraway consumers.Because of the commodity PCs' network and GPU capability, it is adaptable.Heterogeneous computing is the utilization of heterogeneous architectures by applications.Figure 1 demonstrates how heterogeneous architectures are made up of various processor types, each of which has a distinct set of advantages and disadvantages, such as GPUs and CPUs with multiple cores.A variety of hardware can be used on these platforms, varying in power consumption and performance [2].As a result, heterogeneous systems improve performance while lowering energy usage [3].

Problem Statement.
By efciently processing a vast amount of image data, state-of-the-art heterogeneous frameworks for image processing applications provide high performance.It is important to note that for these techniques to achieve good application performance, they need a minimum amount of support to distribute data between nodes and between CPUs and GPUs based on processing power.Nevertheless, this can be achieved by utilizing a load balancing technique that divides and distributes data between the nodes as well as between the CPU and GPU on each node, depending on their computational capabilities.As a result, an efective workload allocation policy must be implemented to improve application performance.

Aims and Objectives.
Te following aims and objectives are deemed to support the problem description and will take us to the desired goal: (i) To give programmers an easy-to-use image processing framework that automatically distributes workload over a heterogeneous cluster, resulting in improved performance (ii) Within the node, automatically partition image data between CPU and GPU

Contributions
(i) A new method for partitioning data into optimal split sizes which ensures locality for computations by ensuring that images within the given split cannot exceed the boundaries of the split (ii) To maximize resource efciency and minimize data transfer, splits are dispersed across nodes and within nodes according to their computational capabilities (iii) Instead of acquiring expensive supercomputers or specialist vector machines, a commodity computer systems cluster can readily manage huge amounts of image data Te rest of the paper is organized as follows: Section 2 describes the literature review, Section 3 proposed the framework, Section 4 is results and analysis while Section 5 contains the conclusion and future work.

Literature Review
It aims to provide image processing in a heterogeneous environment by combining multicore CPU and GPU methods.In a heterogeneous cluster, diverse computing capability processors are paired together to ofer a programming framework that allows for an optimal split size, even/balanced job assignment, and maximum resource efciency.

Image Processing in a Distributed
Environment.In image processing applications, distributed systems have quickly become the preferred platform because of their fast-processing speed, scalability, and efciency.It is feasible to handle large quantities of uploaded image data using distributed systems.Due to the growth of distributed systems, applications with large storage and processing needs are becoming more and more popular.Data sharing, device sharing, device connectivity, and task distribution fexibility are all advantages of distributed systems versus single processor systems.Many open-source programming paradigms, such as Spark, have been created to assist in efcient data processing in distributed environments [7], Te UC Berkeley MapReduce and Storm6 Spark [4] tools allow appealing computations such as data mining and machine learning.Te storm is a distributed real-time computing system that successfully handles unbounded data streams.Te MapReduce architecture is a popular choice for large data analysis because of its capacity to handle semistructured and unstructured data in parallel [2].
In addition to image processing applications [5], face and gesture recognition [6], face tracking [7], video 2 Scientifc Programming detection of textual words from online lecture videos [9], as well as video surveillance [10], the Apache Hadoop framework has been used in many other applications.Te Hadoop MapReduce architecture was used to create a satellite image application for a Qatari environmental study center [11].Hadoop has also been used to manage enormous amounts of image data in content-based image retrieval (CBIR) [12].

GPU-Based Image Processing.
Using the OpenGL graphics library, the GPU was employed for feature extraction and tracking [13].GPUs have been employed in applications such as Canny edge detection [14], satellite image processing [15], and medical image processing [16].
Te GPU is faster than the CPU at processing the integral of photos, as shown in a study of face detection using the violajones method [17].When image data is generated in realtime from satellites and must be processed fast, GPUs have been employed for image smoothing [18] and cloud removal [19] operations.It has been demonstrated that GPUs can be used in the medical feld to detect brain tumor cells, complete several stages of operations, and provide fantastic performance when processing large amounts of image data rapidly [20].

Image Processing Using Heterogeneous Hadoop Clusters.
Using parallel processing cores and GPUs, heterogeneous clusters assist in delivering data at high speeds and scalability to distant consumers [21].To efectively analyze large amounts of data, the CUDA on the Hadoop framework was employed [22], which improves application performance by combining Hadoop's distributed computing capabilities with the GPU's high parallel processing structure [23,24].Mars is a framework that combines GPU capabilities with the Hadoop framework, and it is designed to process web documents (searches and logs) [25].Tere are also three frameworks that combine GPU and Hadoop for high performance, though they were designed for specialized scientifc tasks rather than image processing.Tese frameworks include MAPCG [26], StreamMR [27], and GPMR [28].
Hadoop Image Processing Interface (HIPI) was created to handle large amounts of tiny image data efectively; however, it does not support GPU [29].

Proposed Framework
In the section before, some of the challenges that result in an uneven work distribution in a diversifed context were covered.Te creation of a programming framework that can efciently distribute data across nodes, and then within each node between CPUs and GPUs based on their processing capabilities, in a heterogeneous environment, is required to address these issues.According to Figure 2, the proposed programming framework demonstrates how heterogeneous data may be distributed efciently in a large amount of image data.Te process involves two phases data is dispersed to individual nodes and then distributed within each node between the CPU and GPU in this step Tis work focuses on efciently workload distribution in a heterogeneous cluster.

Proposed Distribution of Workload among Nodes.
Workloads can be divided into Hadoop workloads and cluster node workloads.Te nodes process the data when it has been received.Images must be evenly distributed across cluster nodes in this study to make the best use of the cluster's processing and memory capabilities.A new distribution policy is advised for the distribution of photos.Te images were chosen for the suggested technique in the same size, even if the photographs come in a variety of sizes, and each split will include one or more images.However, doing so would degrade performance because one image cannot be split into numerous parts.
Images in the proposed distribution scheme are grouped together so that every split contains many images that ft within the split size.To prevent the image from exceeding the split boundaries, the split size is determined by the image size.Figure 3 depicts several photos that have been divided and are ready to be distributed among nodes, while Figure 4 depicts the ideal split size based on the block default size.
To address the issue of uneven splits, when photos are dispersed unevenly across splits, the split size must be  4 Scientifc Programming calculated depending on the sizes of the images.When choosing an input split size in HDFS, the Input split size must be set according to the image size, then several images of the same size can be accommodated.(

Selection of
Tis method computes output splits based on the size of the input image, and no image can cross two input splits.

Workload Distribution between CPUs and GPUs within a
Node.A heterogeneous cluster has coprocessor GPUs, which are faster than CPUs.As a result, job assignments must be based on the processors' computational capabilities for best performance.A new efective workload allocation technique in a heterogeneous Hadoop cluster is suggested to achieve excellent performance for image processing applications.
Tis load balancing method is visually explained in Figure 5. Tis phase involves splitting up each map task into fxed-size images.Te map function is called every time a split occurs, which takes an input pair of a key and value.By using the map function, each image in the split is read.By applying the proposed approach, the map function checks ratios for all images within the split and assigns images based on their computational capabilities to the CPU and GPU.A sample image is executed simultaneously on CPU and GPU to determine the execution time of both CPU and GPU on an algorithm.An image ratio is created by comparing the execution times of each processor and comparing how many

Efcient Workload Distribution
Implementation.Tis section describes the implementation of the proposed approach.Splits are formed by defning three variables in the setSplitSize() method of the class CPU-GPU given in Table 1.Using HDFS block size as a reference for split size calculation, one can specify the following formats for fle information: (i) fle size, (ii) default split size, and (iii) optimal split size.Te default split size is substituted by the optimal determined split size thanks to the conf setting in run ().
Te sample image, image width, and image height are the three inputs for the CalculateRatio() function, which culates the ratio.Te Concate() method is used to combine the images based on the ratio that will be sent to the GPU, the image class detail is given in Table 2.A method on the GPU called GPUdetect() is used to generate an edge detection operation based on the width, height, and the total number of pixels in an image.Once the photos have been processed, they are separated once more to separate out each individual image, which is then saved in an output fle.Te (key, value) pair is created by using two parameters in the map () function of the ImageMapper class to specify details about the packaged photos and their actual data, the CPU-GPU, and the ImageMapper Class are shown in Table 1 and Table 2, respectively, while the variations in execution time

Class Public class CPU-GPU Methods
Public static setSplitSize() Each division size is determined using this method.

Public static
GPUDetect (foat pixels, int width, int height).Processing of GPU is determined using this method.

Public void
CalculateRatio (BuferedImage bi, int width, int height).Te ratio between GPU and CPU can be calculated using this method.

Public static void
Concat (BuferedImage bi, int c).Using this method, the images are concatenated into a single bundle.

Public static bufered image[]
Split (int c, int width, int height).Tis method divides data into equal splits.6 Scientifc Programming (milliseconds) for CPU and GPU for Edge Detection details are given in Table 3.

Results and Analysis
In this section, all the results of the conducted experiments are shown and analyzed in detail.

Comparative Analysis of CPU and GPU. Te results in
Table 4 and a bar chart diagram in Figure 7 show the execution times for four diferent resolution images on a CPU and GPU, respectively, in milliseconds.In the experiment, it was discovered that when the image size rose, both the CPU and GPU execution times also increased; however, the GPU execution times increased more slowly than the exponentially rising CPU execution times.Te variations in execution times for the CPU and GPU are seen in Figures 8 and 9, respectively.Table 5 demonstrates that as image size increases, the variance in GPU execution time is less than it is for CPU execution [27].By demonstrating the efect of increasing image size on calculation time, this  experiment illustrates how the GPU can be used to improve application performance by considering the overhead of loading data from CPU memory to GPU.

Varying/Increasing Number of Images on GPU.
Te experiment in Figure 10 demonstrates how the integration of images afects the performance of the program on a GPU.Te x-axis represents the number of images integrated with each other, while the y-axis shows the execution duration in milliseconds.In Tables 6 and 7, the execution time and standard deviation for this experiment are presented by integrating four images of diferent resolutions (1, 2, 3, and 4 images).Based on experiments, a single image with a resolution of 1024 × 768 is processed independently in 42 milliseconds, but four images with the same resolution are combined in 51 milliseconds.
According to the experimental results, the variation in execution time is directly attributed to the number of exchanges between the CPU and GPU for the transfer of each image and the subsequent writing of the result to the CPU.So, these data shifting and loading processes are executed for each individual image, but when it comes to image integration, all the integrated photos are handled in one cycle.

Comparative Analysis of Performance on Diferent
Platforms.Figure 11 shows the average execution time for the suggested solution (Hadoop + GPU) versus current approaches.For all resolutions of photos displayed in Table 5, the suggested framework takes much less time to execute than other current frameworks (HIPI, HIPI + GPU, Hadoop + GPU).Te results of the proposed solution (Hadoop + GPU) show that using an efective workload allocation mechanism in heterogeneous systems reduces average execution time while improving overall application performance.Table 8 demonstrates the signifcant execution time variance between the suggested method (Hadoop + GPU) and the existing platforms (HIPI, HIPI + GPU, and Hadoop + GPU).Te fndings of the    experiment show that the main factor that signifcantly improves application performance and completely utilizes the resources available is the suggested efcient workload allocation policy.

Conclusion and Future Work
Te goal of this paper is to introduce a novel programming framework utilizing the Hadoop MapReduce programming model and graphics processing units (GPUs).Te suggested technique ofers these advantages over existing approaches for image processing applications on heterogeneous clusters.
A new method for partitioning data into optimal split sizes ensures locality for computations by ensuring that images within the given split cannot exceed the boundaries of the split, to maximize resource efciency and minimize data transfer, splits are dispersed across nodes and within nodes according to their computational capabilities and instead of acquiring expensive supercomputers or specialist vector machines, a commodity computer systems cluster can readily manage huge amounts of image data.As a result, future work will focus on developing a split size that can easily support varied image sizes and divide them among nodes as well as inside each node between CPU and GPU.Real-time image processing refers to the completion of certain activities in a set period.In certain image processing applications, a stream of images is created that must be processed within a certain amount of time to ensure that an image does not miss its deadline.Te suggested technique in heterogeneous systems will be used to process this stream of images within the stated timeframe in the future study.

Figure 1 :
Figure 1: A generic diagram of the heterogeneous Hadoop cluster.

Figure 7 :
Figure 7: Average execution time (milliseconds) of CPU and GPU using a single image.

Figure 8 :
Figure 8: Variations in execution time of a single image on CPU.

Figure 9 :
Figure 9: Variations in execution time of a single image on GPU.

Figure 10 :
Figure 10: Average execution time (milliseconds) of an increasing number of images on a GPU.
Input Split Size.Te ideal input split size represents I here, the default split size represents D, the image size represents S, and no represents the maximum number of images that can be accommodated by the ideal input split.Te total number of images in the dataset represents Ti, and the number of splits with an equal distribution represents Sn.
Divide d by s and use the foor function to disregard the fractional component to compute no.To get I, multiply s by no.no � ⌊d/s⌋, I � no * s.

Table 2 :
Te ImageMapper class.Class Public static class ImageMapper extends mapper < bundle header, image bundle, text intwritable> Methods Protected void Map (bundle header key, image bundle value, context), the mapper that consumes the split and processes it is determined by this method.

Table 3 :
Variations in execution time (milliseconds) for CPU and GPU for edge detection.

Table 4 :
Average execution time (milliseconds) for CPU and GPU for edge detection.

Table 6 :
Average execution time (milliseconds) while scaling the number of images on a GPU.

Table 7 :
Increase in execution time of an increasing number of images on a GPU.