MapReduce Based Parallel Neural Networks in Enabling Large Scale Machine Learning

Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. However, ANNs are notably slow in computation especially when the size of data is large. Nowadays, big data has received a momentum from both industry and academia. To fulfill the potentials of ANNs for big data applications, the computation process must be speeded up. For this purpose, this paper parallelizes neural networks based on MapReduce, which has become a major computing model to facilitate data intensive applications. Three data intensive scenarios are considered in the parallelization process in terms of the volume of classification data, the size of the training data, and the number of neurons in the neural network. The performance of the parallelized neural networks is evaluated in an experimental MapReduce computer cluster from the aspects of accuracy in classification and efficiency in computation.


Introduction
Recently, big data has received a momentum from both industry and academia. Many organizations are continuously collecting massive amounts of datasets from various sources such as the World Wide Web, sensor networks, and social networks. In [1], big data is defined as a term that encompasses the use of techniques to capture, process, analyze, and visualize potentially large datasets in a reasonable time frame not accessible to standard IT technologies. Basically, big data is characterized with three Vs [2]: (i) Volume: the sheer amount of data generated.
(ii) Velocity: the rate at which the data is being generated.
(iii) Variety: the heterogeneity of data sources.
Artificial neural networks (ANNs) have been widely used in pattern recognition and classification applications. Back-propagation neural network (BPNN), the most popular one of ANNs, could approximate any continuous nonlinear functions by arbitrary precision with an enough number of neurons [3]. Normally, BPNN employs the back-propagation algorithm for training which requires a significant amount of time when the size of the training data is large [4]. To fulfill the potentials of neural networks in big data applications, the computation process must be speeded up with parallel computing techniques such as the Message Passing Interface (MPI) [5,6]. In [7], Long and Gupta presented a scalable parallel artificial neural network using MPI for parallelization. It is worth noting that MPI was designed for data intensive applications with high performance requirements. MPI provides little support in fault tolerance. If any fault happens, an MPI computation has to be started from the beginning. As a result, MPI is not suitable for big data applications, which would normally run for many hours during which some faults might happen. This paper presents a MapReduce based parallel backpropagation neural network (MRBPNN). MapReduce has become a de facto standard computing model in support of big data applications [8,9]. MapReduce provides a reliable, fault-tolerant, scalable, and resilient computing framework for storing and processing massive datasets. MapReduce 2 Computational Intelligence and Neuroscience scales well with ever increasing sizes of datasets due to its use of hash keys for data processing and the strategy of moving computation to the closest data nodes. In MapReduce, there are mainly two functions which are the Map function (mapper) and the Reduce function (reducer). Basically, a mapper is responsible for actual data processing and generates intermediate results in the form of ⟨key, value⟩ pairs. A reducer collects the output results from multiple mappers with secondary processing including sorting and merging the intermediate results based on the key values. Finally the Reduce function generates the computation results.
We present three MRBPNNs (i.e., MRBPNN 1,MRBPNN 2,and MRBPNN 3) to deal with different data intensive scenarios. MRBPNN 1 deals with a scenario in which the dataset to be classified is large. The input dataset is segmented into a number of data chunks which are processed by mappers in parallel. In this scenario, each mapper builds the same BPNN classifier using the same set of training data. MRBPNN 2 focuses on a scenario in which the volume of the training data is large. In this case, the training data is segmented into data chunks which are processed by mappers in parallel. Each mapper still builds the same BPNN but uses only a portion of the training dataset to train the BPNN. To maintain a high accuracy in classification, we employ a bagging based ensemble technique [10] in MRBPNN 2. MRBPNN 3 targets a scenario in which the number of neurons in a BPNN is large. In this case, MRBPNN 3 fully parallelizes and distributes the BPNN among the mappers in such a way that each mapper employs a portion of the neurons for training.
The rest of the paper is organized as follows. Section 2 gives a review on the related work. Section 3 presents the designs and implementations of the three parallel BPNNs using the MapReduce model. Section 4 evaluates the performance of the parallel BPNNs and analyzes the experimental results. Section 5 concludes the paper.

Related Work
ANNs have been widely applied in various pattern recognition and classification applications. For example, Jiang et al. [11] employed a back-propagation neural network to classify high resolution remote sensing images to recognize roads and roofs in the images. Khoa et al. [12] proposed a method to forecast the stock price using BPNN.
Traditionally, ANNs are employed to deal with a small volume of data. With the emergence of big data, ANNs have become computationally intensive for data intensive applications which limits their wide applications. Rizwan et al. [13] employed a neural network on global solar energy estimation. They considered the research as a big task, as traditional approaches are based on extreme simplicity of the parameterizations. A neural network was designed which contains a large number of neurons and layers for complex function approximation and data processing. The authors reported that in this case the training time will be severely affected. Wang et al. [14] pointed out that currently large scale neural networks are one of the mainstream tools for big data analytics. The challenge in processing big data with large scale neural networks includes two phases which are the training phase and the operation phase. To speed up the computations of neural networks, there are some efforts that try to improve the selection of initial weights [15] or control the learning parameters [16] of neural networks. Recently, researchers have started utilizing parallel and distributed computing technologies such as cloud computing to solve the computation bottleneck of a large neural network [17][18][19]. Yuan and Yu [20] employed cloud computing mainly for exchange of privacy data in a BPNN implementation in processing ciphered text classification tasks. However, cloud computing as a computing paradigm simply offers infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). It is worth noting that cloud computing still needs big data processing models such as the MapReduce model to deal with data intensive applications. Gu et al. [4] presented a parallel neural network using in-memory data processing techniques to speed up the computation of the neural network but without considering the accuracy aspect of the implemented parallel neural network. In this work, the training data is simply segmented into data chunks which are processed in parallel. Liu et al. [21] presented a MapReduce based parallel BPNN in processing a large set of mobile data. This work further employs AdaBoosting to accommodate the loss of accuracy of the parallelized neural network. However, the computationally intensive issue may exist not only at the training phase but also at the classification phase. In addition, AdaBoosting is a popular sampling technique; it may enlarge the weights of wrongly classified instances which would deteriorate the algorithm accuracy.

Parallelizing Neural Networks
This section presents the design details of the parallelized MRBPNN 1, MRBPNN 2, and MRBPNN 3. First, a brief review of BPNN is introduced.

Back-Propagation Neural
Network. Back-propagation neural network is a multilayer feed forward network which trains the training data using an error back-propagation mechanism. It has become one of the most widely used neural networks. BPNN can perform a large volume of input-output mappings without knowing their exact mathematical equations. This benefits from the gradient-descent feature of its back-propagation mechanism. During the error propagation, BPNN keeps tuning the parameters of the network until it adapts to all input instances. A typical BPNN is shown in Figure 1, which consists of an arbitrary number of inputs and outputs.
Generally speaking, a BPNN can have multiple network layers. However, it has been widely accepted that a three-layer BPNN would be enough to fit the mathematical equations which approximate the mapping relationships between inputs and outputs [3]. Therefore, the topology of a BPNN usually contains three layers: input layer, one hidden layer, and output layer. The number of inputs in the input layer is mainly determined by the number of elements in an input eigenvector; for instance, let s denote an input instance: Computational Intelligence and Neuroscience Then, the number of inputs is . Similarly, the number of neurons in the output layer is determined by the number of classifications. And the number of neurons in the hidden layer is determined by users. Every input of a neuron has a weight , where and represent the source and destination of the input. Each neuron also maintains an optional parameter which is actually a bias for varying the activity of the th neuron in a layer. Therefore, let denote the output from a previous neuron and let denote the output of this layer; the input of neurons located in both the hidden and output layer can be represented by The output of a neuron is usually computed by the sigmoid function, so the output can be computed by After the feed forward process is completed, the backpropagation process starts. Let Err represent the errorsensitivity and let represent the desirable output of neuron in the output layer; thus, Let Err represent the error-sensitivity of one neuron in the last layer and let represent its weight; thus, Err of a neuron in the other layers can be computed using After Err is computed, the weights and biases of each neuron are tuned in back-propagation process using After the first input vector finishes tuning the network, the next round starts for the following input vectors.
The input keeps training the network until (7) is satisfied for a single output or (8) is satisfied for multiple outputs: min

MapReduce Computing Model.
MapReduce has become the de facto standard computing model in dealing with data intensive applications using a cluster of commodity computers. Popular implementations of the MapReduce computing model include Mars [22], Phoenix [23], and Hadoop framework [24,25]. The Hadoop framework has been widely taken up by the community due to its open source feature. Hadoop has its Hadoop Distributed File System (HDFS) for data management. A Hadoop cluster has one name node (Namenode) and a number of data nodes (Datanodes) for running jobs. The name node manages the metadata of the cluster whilst a data node is the actual processing node. The Map functions (mappers) and Reduce functions (reducers) run on the data nodes. When a job is submitted to a Hadoop cluster, the input data is divided into small chunks of an equal size and saved in the HDFS. In terms of data integrity, each data chunk can have one or more replicas according to the cluster configuration. In a Hadoop cluster, mappers copy and read data from either remote or local nodes based on data locality. The final output results will be sorted, merged, and generated by reducers in HDFS. (ii) denotes a dataset;

The Design of MRBPNN
(iii) denotes the length of ; it also determines the number of inputs of a neural network; (iv) the inputs are capsulated by a format of ⟨instance , target , type⟩; (v) instance represents , which is the input of a neural network; (vi) target represents the desirable output if instance is a training instance; (vii) type field has two values, "train" and "test," which marks the type of instance ; if "test" value is set, target field should be left empty.
Files which contain instances are saved into HDFS initially. Each file contains all the training instances and a portion of the testing instances. Therefore, the file number determines the number of mappers to be used. The file content is the input of MRBPNN 1.
When the algorithm starts, each mapper initializes a neural network. As a result, there will be neural networks in the cluster. Moreover, all the neural networks have exactly the same structure and parameters. Each mapper reads data in  the form of ⟨instance , target , type⟩ from a file and parses the data records. If the value of type field is "train," instance is input into the input layer of the neural network. The network computes the output of each layer using (2) and (3), until the output layer generates an output which indicates the completion of the feed forward process. And then the neural network in each mapper starts the back-propagation process. It computes and updates new weights and biases for its neurons using (4) to (6). The neural network inputs instance +1 . Repeat the feed forward and back-propagation process until all the instances which are labeled as "train" are processed and the error is satisfied. Each mapper starts classifying instances labeled as "test" by running the feed forward process. As each mapper only classifies a portion of the entire testing dataset, the efficiency is improved. At last, each mapper outputs intermediate output in the form of ⟨instance , ⟩, where instance is the key and represents the output of the th mapper. One reducer starts collecting and merging all the outputs of the mappers. Finally, the reducer outputs ⟨instance , ⟩ into HDFS. In this case, represents the final classification result of instance . Figure 2 shows the architecture of MRBPNN 1 and Algorithm 1 shows the pseudocode.

The Design of MRBPNN 2.
MRBPNN 2 focuses on the scenario in which a BPNN has a large volume of training data. Consider a training dataset with a number of instances. As shown in Figure 3, MRBPNN 2 divides into data chunks of which each data chunk is processed by a mapper for training, respectively: Each mapper in the Hadoop cluster maintains a BPNN, and each is considered as the input training data for the neural network maintained in mapper . As a result, each BPNN in a mapper produces a classifier based on the trained parameters: To reduce the computation overhead, each classifier is trained with a part of the original training dataset. However, a critical issue is that the classification accuracy of a mapper will be significantly degraded using only a portion of the training data. To solve this issue, MRBPNN 2 employs ensemble technique to maintain the classification accuracy by combining a number of weak learners to create a strong learner.
3.4.1. Bootstrapping. Training diverse classifiers from a single training dataset has been proven to be simple compared with the case of finding a strong learner [26]. A number of techniques exist for this purpose. A widely used technique is to resample the training dataset based on bootstrap aggregating such as bootstrapping and majority voting. This can reduce the variance of misclassification errors and hence increases the accuracy of the classifications.
As mentioned in [26], balanced bootstrapping can reduce the variance when combining classifiers. Balanced bootstrapping ensures that each training instance equally appears in the bootstrap samples. It might not be always the case that each bootstrapping sample contains all the training instances. The most efficient way of creating balanced bootstrap samples is to construct a string of instances 1 , 2 , 3 , . . . , repeating times so that a sequence of 1 , 2 , 3 , . . . , can be achieved. A random permutation of the integers from 1 to is taken. Therefore, the first bootstrapping sample can be created from (1)

Majority
Voting. This type of ensemble classifiers performs classifications based on the majority votes of the base classifiers [26]. Let us define the prediction of the th classifier as , ∈ {1, 0}, = 1, . . . , and = 1, . . . , , where is the number of classifiers and is the number of classes. If the th classifier chooses class , then , = 1; otherwise, , = 0. Then, the ensemble prediction for class is computed using where represents the th subset, which belongs to entire dataset . represents the total number of subsets. Each is saved in one file in HDFS. Each instance = { 1 , 2 , 3 , . . . , }, ∈ , is defined in the format of ⟨instance , target , type⟩, where (i) instance represents one bootstrapped instance , which is the input of neural network;   Computational Intelligence and Neuroscience (ii) represents the number of inputs of the neural network; (iii) target represents the desirable output if instance is a training instance; (iv) type field has two values, "train" and "test," which marks the type of instance ; if "test" value is set, target field should be left empty.
When MRBPNN 2 starts, each mapper constructs one BPNN and initializes weights and biases with random values between −1 and 1 for its neurons. And then a mapper inputs one record in the form of ⟨instance , target , type⟩ from the input file.
The mapper firstly parses the data and retrieves the type of the instance. If the type value is "train," the instance is fed into the input layer. Secondly, each neuron in different layers computes its output using (2) and (3) until the output layer generates an output which indicates the completion of the feed forward process. Each mapper starts a back-propagation process and computes and updates weights and biases for neurons using (4) to (6). The training process finishes until all the instances marked as "train" are processed and error is satisfied. All the mappers start feed forwarding to classify the testing dataset. In this case, each neural network in a mapper generates the classification result of an instance at the output layer. Each mapper generates an intermediate output in the form of ⟨instance , ⟩, where instance is the key and represents the outputs of the th mapper.
Finally, a reducer collects the outputs of all the mappers. The outputs with the same key are merged together. The reducer runs majority voting using (11) and outputs the result of instance into HDFS in the form of ⟨instance , ⟩, where represents the voted classification result of instance . Figure 3 shows the algorithm architecture and Algorithm 2 presents the pseudocode of MRBPNN 2. MRBPNN 3. MRBPNN 3 aims at the scenario in which a BPNN has a large number of neurons. The algorithm enables an entire MapReduce cluster to maintain one neural network across it. Therefore, each mapper holds one or several neurons.

The Design of
There are a number of iterations that exist in the algorithm with layers. MRBPNN 3 employs a number of − 1 MapReduce jobs to implement the iterations. The feed forward process runs in − 1 rounds whilst the back-propagation process occurs only in the last round. A data format in the form of ⟨index , instance , , , target , { 2 , 2 , . . . , −1 , −1 }⟩ has been designed to guarantee the data passing between Map and Reduce operations, where (i) index represents the th reducer; (ii) instance represents the th training or testing instance of the dataset; one instance is in the form of instance = { 1 , 2 , . . . , }, where is length of the instance; (iii) represents a set of weights of an input layer, whilst represents the biases of the neurons in the first hidden layer; (iv) target represents the encoded desirable output of a training instance instance ; (v) the list of { 2 , 2 , . . . , −1 , −1 } represents the weights and biases for next layers; it can be extended based on the layers of the network; for a standard three-layer neural network, this option becomes { , }.
Before MRBPNN 3 starts, each instance and the information defined by data format are saved in one file in HDFS. The number of the layers is determined by the length of { 2 , 2 . . . , −1 , −1 } field. The number of neurons in the next layer is determined by the number of files in the input folder. Generally, different from MRBPNN 1 and MRBPNN 2, MRBPNN 3 does not initialize an explicit neural network; instead, it maintains the network parameters based on the data defined in the data format.
When MRBPNN 3 starts, each mapper initially inputs one record from HDFS. And then it computes the output of a neuron using (2) and (3). The output is generated by a mapper, which labels index as a key and the neuron's output as a value in the form of ⟨index , , { 2 , 2 . . . , −1 , −1 }, target ⟩, where represents the neuron's output.
Parameter index can guarantee that the th reducer collects the output, which maintains the neural network structure. It should be mentioned that if the record is the first one processed by MRBPNN 3, { 2 , 2 . . . , −1 , −1 } will be also initialized with random values between −1 and 1 by the mappers. The th reducer collects the results from the mappers in the form of ⟨index , , { 2 , 2 . . . , −1 , −1 }, target ⟩. These reducers generate outputs. The index of the reducer output explicitly tells the th mapper to start processing this output file. Therefore, the number of neurons in the next layer can be determined by the number of reducer output files, which are the input data for the next layer neurons. Subsequently, mappers start processing their corresponding inputs by computing (2) and (3) using 2 and 2 .
The above steps keep looping until reaching the last round. The processing of this last round consists of two steps. The first step is that mappers also process ⟨index , , { −1 , −1 }, target ⟩, compute neurons' outputs, and generate results in the forms of ⟨ , target ⟩. One reducer collects the output results of all the mappers in the form of ⟨ 1 , 2 , 3 , . . . , , target ⟩. In the second step, the reducer executes the back-propagation process. The reducer computes new weights and biases for each layer using (4) to (6). MRBPNN 3 retrieves the previous outputs, weights, and biases from the input files of mappers, and then it writes the updated weights and biases , , { 2 , 2 , . . . , −1 , −1 } into the initial input file in the form of ⟨index , instance , , , target , { 2 , 2 , . . . , −1 , −1 }⟩. The reducer reads the second instance in the form of ⟨instance +1 , target +1 ⟩ for which the fields instance and target in the input file are replaced by instance +1 and target +1 . The training process continues until all the instances are processed and error is satisfied.

Performance Evaluation
We have implemented the three parallel BPNNs using Hadoop, an open source implementation framework of the MapReduce computing model. An experimental Hadoop cluster was built to evaluate the performance of the algorithms. The cluster consisted of 5 computers in which 4 nodes are Datanodes and the remaining one is Namenode. The cluster details are listed in Table 1.
Two testing datasets were prepared for evaluations. The first dataset is a synthetic dataset. The second is the Iris dataset which is a published machine learning benchmark dataset [27]. Table 2 shows the details of the two datasets. Training terminates (9) Each mapper inputs ⟨index , instance , , , target , , ⟩, instance ∈ (10) Execute (1), (2), (3), (4), (5), (6), outputs ⟨ , target ⟩ (11) th reducer outputs End Algorithm 3: MRBPNN 3. We implemented a three-layer neural network with 16 neurons in the hidden layer. The Hadoop cluster was configured with 16 mappers and 16 reducers. The number of instances was varied from 10 to 1000 for evaluating the precision of the algorithms. The size of the datasets was varied from 1 MB to 1 GB for evaluating the computation efficiency of the algorithms. Each experiment was executed five times and the final result was an average. The precision is computed using where represents the number of correctly recognized instances. represents the number of wrongly recognized instances. represents the precision. Weights, biases · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Classification Precision.
The classification precision of MRBPNN 1 was evaluated using a varied number of training instances. The maximum number of the training instances was 1000 whilst the maximum number of the testing instances was also 1000. The large number of instances is based on the data duplication. Figure 5 shows the precision results of MRBPNN 1 in classification using 10 mappers. It can be observed that the precision keeps increasing with an increase in the number of training instances. Finally, the precision of MRBPNN 1 on the synthetic dataset reaches 100% while the precision on the Iris dataset reaches 97.5%. In this test, the behavior of the parallel MRBPNN 1 is quite similar to that of the standalone BPNN. The reason is that MRBPNN 1 does not distribute the BPNN among the Hadoop nodes; instead, it runs on Hadoop to distribute the data.
To evaluate MRBPNN 2, we designed 1000 training instances and 1000 testing instances using data duplication.
The mappers were trained by subsets of the training instances and produced the classification results of 1000 testing instances based on bootstrapping and majority voting. MRBPNN 2 employed 10 mappers each of which inputs a number of training instances varying from 10 to 1000. Figure 6 presents the precision results of MRBPNN 2 on the two testing datasets. It also shows that, along with the increasing number of training instances in each subneural network, the achieved precision based on majority voting keeps increasing. The precision of MRBPNN 2 on the synthetic dataset reaches 100% whilst the precision on the Iris dataset reaches 97.5%, which is higher than that of MRBPNN 1.
MRBPNN 3 implements a fully parallel and distributed neural network using Hadoop to deal with a complex neural network with a large number of neurons. Figure 7 shows the performance of MRBPNN 3 using 16 mappers. The precision also increases along with the increasing number of training instances for both datasets. It also can be observed that the stability of the curve is quite similar to that of MRBPNN 1. Both curves have more fluctuations than that of MRBPNN 2. Figure 8 compares the overall precision of the three parallel BPNNs. MRBPNN 1 and MRBPNN 3 perform similarly, whereas MRBPNN 2 performs the best using bootstrapping and majority voting. In addition, the precision of MRBPNN 2 in classification is more stable than that of both MRBPNN 1 and MRBPNN 3. Figure 9 presents the stability of the three algorithms on the synthetic dataset showing the precision of MRBPNN 2 in classification is highly stable compared with that of both MRBPNN 1 and MRBPNN 3.

Computation Efficiency.
A number of experiments were carried out in terms of computation efficiency using the synthetic dataset. The first experiment was to evaluate the efficiency of MRBPNN 1 using 16 mappers. The volume of data instances was varied from 1 MB to 1 GB. Figure 10 clearly shows that the parallel MRBPNN 1 significantly outperforms the standalone BPNN. The computation overhead of the standalone BPNN is low when the data size is less than 16 MB. However, the overhead of the standalone BPNN increases sharply with increasing data sizes. This is mainly because MRBPNN 1 distributes the testing data into 4 data nodes in the Hadoop cluster, which runs in parallel in classification. Figure 11 shows the computation efficiency of MRBPNN 2 using 16 mappers. It can be observed that when the data size is small, the standalone BPNN performs better. However, the computation overhead of the standalone BPNN increases rapidly when the data size is larger than 64 MB. Similar to MRBPNN 1, the parallel MRBPNN 2 scales with increasing data sizes using the Hadoop framework. Figure 12 shows the computation overhead of MRBPNN 3 using 16 mappers. MRBPNN 3 incurs a higher overhead than both MRBPNN 1 and MRBPNN 2. The reason is that both MRBPNN 1 and MRBPNN 2 run training and classification within one MapReduce job, which means mappers and reducers only need to start once. However, MRBPNN 3 contains a number of jobs. The algorithm has to  start mappers and reducers a number of times. This process incurs a large system overhead which affects its computation efficiency. Nevertheless, Figure 12 shows the feasibility of fully distributing a BPNN in dealing with a complex neural network with a large number of neurons.

Conclusion
In this paper, we have presented three parallel neural networks (MRBPNN 1, MRBPNN 2, and MRBPNN 3) based on the MapReduce computing model in dealing with data intensive scenarios in terms of the size of classification dataset, the size of the training dataset, and the number of neurons, respectively. Overall, experimental results have shown the computation overhead can be significantly reduced using a number of computers in parallel. MRBPNN 3 shows the feasibility of fully distributing a BPNN in a computer cluster but incurs a high overhead of computation due to continuous starting and stopping of mappers and reducers in Hadoop environment. One of the future works is to research inmemory processing to further enhance the computation efficiency of MapReduce in dealing with data intensive tasks with many iterations.