The modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data. Such multidimensional datasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from them. Analyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changing market trends and customer requirements. While the Hadoop framework is a popular platform for processing larger datasets, there are a number of other computing infrastructures, available to use in various application domains. The primary focus of the study is how to classify major big data resource management systems in the context of cloud computing environment. We identify some key features which characterize big data frameworks as well as their associated challenges and issues. We use various evaluation metrics from different aspects to identify usage scenarios of these platforms. The study came up with some interesting findings which are in contradiction with the available literature on the Internet.
1. Introduction
We live in the information age, and an important measurement of present times is the amount of data that is generated anywhere around us. Data is becoming increasingly valuable. Enterprises are aiming at unlocking data’s hidden potential and deliver competitive advantage [1]. Stratistics MRC projected that the data analytics and Hadoop market, which accounted for $8.48 billion in 2015, is expected to reach at $99.31 billion by 2022 [2]. The global big data market has estimated that it will jump from $14.87 billion in 2013 to $46.34 billion in 2018 [3]. Gartner has predicted that data will grow by 800 percent over the next five years and 80 percent of the data will be unstructured (e-mails, documents, audio, video, and social media content) and 20 percent will be structured (e-commerce transactions and contact information) [1].
Today’s largest scientific institution, CERN, produces over 200 PB of data per year in the Large Hadron Collider project (as of 2017). The amount of generated data on the Internet has already exceeded 2.5 exabytes per day. Within one minute, 400 hours of videos are uploaded on YouTube, 3.6 million Google searches are conducted worldwide each minute of every day, more than 656 million tweets are shared on Twitter, and more than 6.5 million pictures are shared on Instagram each day. When a dataset becomes so large that its storage and processing become challenging due to the constraints of existing tools and resources, the dataset is referred to as big data [4, 5]. It is the first part of the journey towards delivering decision-making insights. But instead of focusing on people, this process utilizes a much more powerful and evolving technology, given the latest breakthroughs in this field, to quickly analyze huge streams of data, from a variety of sources, and to produce one single stream of useful knowledge [6].
Big data applications might be viewed as the advancement of parallel computing, but with the important exception of the scale. The scale is the necessity arising from the nature of the target issues: data dimensions largely exceed conventional storage units, the level of parallelism needed to perform computation within a strict deadline is high, and obtaining final results requires the aggregation of large numbers of partial results. The scale factor, in this case, does not only have the same effect that it has in classical parallel computing, but it surges towards a dimension in which automated resource management and its exploitation are of significant value [7].
An important factor for the success in big data analytical projects is the management of resources: these platforms use a substantial amount of virtualized hardware resources to optimize the tradeoff between costs and results. Managing such resources is definitely a challenge. Complexity is rooted in their architecture: the first level of complexity stems from their performance requirements of computing nodes: typical big data applications utilize massively parallel computing resources, storage subsystems, and networking infrastructure because of the fact that results are required within a certain time frame, or they can lose their value over time. Heterogeneity is a technological need: evolvability, extensibility, and maintainability of the hardware layer imply that the system will be partially integrated, replaced, or extended by means of new parts, according to the availability on the market and the evolution of technology [7]. Another important consideration of modern applications is the massive amount of data that need to be processed. Such data usually originate from different sets of devices (e.g., public web, business applications, satellites, or sensors) and procedures (e.g., case studies, observational studies, or simulations). Therefore, it is imperative to develop computational architectures with even better performance to support current and future application needs. Historically, this need for computational resources was provided by high-performance computing (HPC) environments such as computer clusters, supercomputers, and grids. In traditional owner-centric HPC environments, internal resources are handled by a single administrative domain [19]. Cluster computing is the leading architecture for this environment. In distributed HPC environments, such as grid computing, virtual organizations manage the provisioning of resources, both internal and external, to meet application needs [20]. However, the paradigm shift towards cloud computing has been widely discussed in more recent researches [19, 21], targeting the execution of HPC workloads on cloud computing environments. Although organizations usually prefer to store their most sensitive data internally (on-premises), huge volumes of big data (owned by the enterprises or generated by third parties) may be stored externally; some of it may already be on a cloud. Retaining all data sources behind the firewall may result in a significant waste of resources. Analyzing the data where it resides either internally or in a public cloud data center makes more sense [1, 22].
Even if cloud computing has to be an enabler to the growth of big data applications, common cloud computing solutions are rather different from big data applications. Typically, cloud computing solutions offer fine-grained, loosely coupled applications, run to serve large numbers of users that operate independently, from multiple locations, possibly on own, private, nonshared data, with a significant amount of interactions, rather than being mainly batch-oriented, and generally fit to be relocated with highly dynamic resource needs. Despite such differences, cloud computing and big data architectures share a number of common requirements, such as automated (or autonomic) fine-grained resource management and scaling related issues [7].
As cloud computing begins to mature, a large number of enterprises are building efficient and agile cloud environments, and cloud providers continue to expand service offerings [1]. Microsoft’s cloud Hadoop offering includes Azure Marketplace, which runs Cloudera Enterprise, MapR, and Hortonworks Data Platform (HDP) in a virtual machine, and Azure Data Lake, which includes Azure HDInsight, Data Lake Analytics, and Data Lake Store as managed services. The platform offers rich productivity suites for database, data warehouse, cloud, spreadsheet, collaboration, business intelligence, OLAP, and development tools, delivering a growing Hadoop stack to Microsoft community. Amazon Web Services reigns among the leaders of cloud computing and big data solutions. Amazon EMR is available across 14 regions worldwide. AWS offers versions of Hadoop, Spark, Tez, and Presto that can work off data stored in Amazon S3 and Amazon Glacier. Cloud Dataproc is Google’s managed Hadoop and Spark cluster to use fully managed cloud services such as Google BigQuery and Bigtable. IBM differentiates BigInsights with end-to-end advanced analytics. IBM BigInsights runs on top of IBM’s SoftLayer cloud infrastructure and can be deployed on more than 30 global data centers. IBM is making significant investments in Spark, BigQuality, BigIntegrate, and IBM InfoSphere Big Match that run natively with YARN to handle the toughest Hadoop use cases [23].
In this paper, we give an overview of some of the most popular and widely used big data frameworks, in the context of cloud computing environment, which are designed to cope with the above-mentioned resource management and scaling problems. The primary object of the study is how to classify different big data resource management systems. We use various evaluation metrics for popular big data frameworks from different aspects. We also identify some key features which characterize big data frameworks as well as their associated challenges and issues. We restricted our study selection criteria to empirical studies from existing literature with reported evidence on performance evaluation of big data resource management frameworks. To the best of our knowledge, thus far there has been no empirical based performance evaluation report on major resource management frameworks. We investigated the validity of existing research by performing a confirmatory study. For this purpose, the standard performance evaluation tests as well as custom load test cases were performed on a 10+1 nodes t2.2xlarge Amazon AWS cluster. For experimentation and benchmarking, we followed the same process as outlined in our earlier study [24].
The study came up with some interesting findings which are in contradiction with the available literature on the Internet. The novelty of the study includes the categorization of cloud-based big data resource management frameworks according to their key features, comparative evaluation of the popular big data frameworks, and the best practices related to the use of big data frameworks in the cloud.
The inclusion and exclusion criteria for relevant research studies are as follows:
We selected only those resource management frameworks for which we found empirical evidence of being offered by various cloud providers.
Several vendors offer their proprietary solutions for big data analysis which could be the potential candidate for comparative analysis being conducted in this study. However, these frameworks were not selected based on two reasons. Firstly, most of these solutions are the extension of open-source solution and hence these exhibit the identical perform results in most of the cases. Secondly, for our empirical studies, researchers mostly prefer open-source solutions as the documentation, usage scenarios, source code, and other relevant details are freely available. Hence, we selected open-source solutions for the performance evaluation.
We did not include the frameworks which are now deprecated or discounted, such as Apache S4, in favor of other resource management systems.
This paper is organized as follows. Section 2 reviews the popular resource management frameworks. The comparison of big data frameworks is presented in Section 3. Based on the comparative evaluation, we categorize these systems in Section 4. Related work is presented in Section 5 and, finally, we present conclusion and possible future directions in Section 6.
2. Big Data Resource Management Frameworks
Big data is offering new emerging trends and opportunities to unearth operational insight towards data management. The most challenging issues for organizations are often that the amount of data is massive which needs to be processed at an optimal speed to synthesize relevant results. Analyzing such huge amount of data from multiple sources can help organizations plan for the future and anticipate changing market trends and customer requirements. In many of the cases, big data is analyzed in batch mode. However, in many situations, we may need to react to the current state of data or analyze the data that is in motion (data that is constantly coming in and needs to be processed immediately). These applications require a continuous stream of often unstructured data to be processed. Therefore, data is continuously analyzed and cached in memory before it is stored on secondary storage devices. Processing streams of data works by filtering in-memory tables of data across a cluster of servers. Any delay in the data analysis can seriously impact customer satisfaction or may result in project failure [25].
While the Hadoop framework is a popular platform for processing huge datasets in parallel batch mode using commodity computational resources, there are a number of other computing infrastructures that can be used in various application domains. The primary focus of this study is to investigate popular big data resource management frameworks which are commonly used in cloud computing environment. Most of the popular big data tools available for cloud computing platform, including the Hadoop ecosystem, are available under open-source licenses. One of the key appeals of Hadoop and other open-source solutions is the low total cost of ownership. While proprietary solutions have expensive license fees and may require more costly specialized hardware, these open-source solutions have no licensing fees and can run on industry-standard hardware [14]. Figure 1 demonstrates the classification of various styles of processing architectures of open-source big data resource management frameworks.
Classification of big data resource management frameworks.
In the subsequent section, we discuss various open-source big data resource management frameworks that are widely used in conjunction with cloud computing environment.
2.1. Hadoop
Hadoop [26] is a distributed programming and storage infrastructure based on the open-source implementation of the MapReduce model [27]. MapReduce is the first and current de facto programming environment for developing data-centric parallel applications for parsing and processing large datasets. The MapReduce is inspired by Map and Reduce primitives used in functional programming. In MapReduce programming, users only have to write the logic of Mapper and Reducer while the process of shuffling, partitioning, and sorting is automatically handled by the execution engine [14, 27, 28]. The data can either be saved in the Hadoop file system as unstructured data or in a database as structured data [14]. Hadoop Distributed File System (HDFS) is responsible for breaking large data files into smaller pieces known as blocks. The blocks are placed on different data nodes, and it is the job of the NameNode to notice what blocks on which data nodes make up the complete file. The NameNode also works as a traffic cop, handling all access to the files, including reads, writes, creates, deletes, and replication of data blocks on the data nodes. A pipeline is a link between multiple data nodes that exists to handle the transfer of data across the servers. A user application pushes a block to the first data node in the pipeline. The data node takes over and forwards the block to the next node in the pipeline; this continues until all the data, and all the data replicas, are saved to disk. Afterwards, the client repeats the process by writing the next block in the file [25].
The two major components of Hadoop MapReduce are job scheduling and tracking. The early versions of Hadoop supported limited job and task tracking system. In particular, the earlier scheduler could not manage non-MapReduce tasks and it was not capable of optimizing cluster utilization. So, a new capability was aimed at addressing these shortcomings which may offer more flexibility, scaling, efficiency, and performance. Because of these issues, Hadoop 2.0 was introduced. Alongside earlier HDFS, resource management, and MapReduce model, it introduced a new resource management layer called Yet Another Resource Negotiator (YARN) that takes care of better resource utilization [25].
YARN is the core Hadoop service to provide two major functionalities: global resource management (ResourceManager) and per-application management (ApplicationMaster). The ResourceManager is a master service which controls NodeManager in each of the nodes of a Hadoop cluster. It includes a scheduler, whose main task is to allocate system resources to specific running applications. All the required system information is tracked by a Resource Container which monitors CPU, storage, network, and other important resource attributes necessary for executing applications in the cluster. The ResourceManager has a slave NodeManager service to monitor application usage statistics. Each deployed application is handled by a corresponding ApplicationMaster service. If more resources are required to support the running application, the ApplicationMaster requests the NodeManager and the NodeManager negotiates with the ResourceManager (scheduler) for the additional capacity on behalf of the application [26].
2.2. Spark
Apache Spark [29], originally developed as Berkeley Spark, was proposed as an alternative to Hadoop. It can perform faster parallel computing operations by using in-memory primitives. A job can load data in either local memory or a cluster-wide shared memory and query it iteratively with much great speed as compared to disk-based systems such as Hadoop MapReduce [27]. Spark has been developed for two applications where keeping data in memory may significantly improve performance: iterative machine learning algorithms and interactive data mining. Spark is also intended to unify the current processing stack, where batch processing is performed using MapReduce, interactive queries are performed using HBase, and the processing of streams for real-time analytics is performed using other frameworks such Twitter’s Storm. Spark offers programmers a functional programming paradigm with data-centric programming interfaces built on top of a new data model called Resilient Distributed Dataset (RDD) which is a collection of objects spread across a cluster stored in memory or disk [28]. Applications in Spark can load these RDDs into the memory of a cluster of nodes and let the Spark engine automatically manage the partitioning of the data and its locality during runtime. This versatile iterative model makes it possible to control the persistence and manage the partitioning of data. A stream of incoming data can be partitioned into a series of batches and is processed as a sequence of small-batch jobs. The Spark framework allows this seamless combination of streaming and batch processing in a unified system. To provide rapid application development, Spark provides clean, concise APIs in Scala, Java, and Python. Spark can be used interactively from the Scala and Python shells to rapidly query big datasets.
Spark is also the engine behind Shark, a complete Apache Hive-compatible data warehousing system that can run much faster than Hive. Spark also supports data access from Hadoop. Spark fits in seamlessly with the Hadoop 2.0 ecosystem (Figure 2) as an alternative to MapReduce, while using the same underlying infrastructure such as YARN and the HDFS. Spark is also an integral part of the SMACK stack to provide the most popular cloud-native PaaS such as IoT, predictive analytics, and real-time personalization for big data. In SMACK, Apache Mesos cluster manager (instead of YARN) is used for dynamic allocation of cluster resources, not only for running Hadoop applications but also for handling heterogeneous workloads.
Hadoop 2.0 ecosystem, source: [14].
The GraphX and MLlib libraries include state-of-the-art graph and machine learning algorithms that can be executed in real time. BlinkDB is a novel parallel, sampling-based approximate query engine for running interactive SQL queries that trade off query accuracy for response time, with results annotated by meaningful error bars. BlinkDB has been proven to run 200 times faster than Hive within an error rate of 2–10%. Moreover, Spark provides an interactive tool called Spark Shell which allows exploiting the Spark cluster in real time. Once interactive applications are created, they may subsequently be executed interactively in the cluster. In Figure 3, we present the general Spark system architecture.
Spark architecture, source: [15].
2.3. Flink
Apache Flink is an emerging competitor of Spark which offers functional programming interfaces, much similar to Spark. It shares many programming primitives and transformations in the same way as what Spark does for iterative development, predictive analysis, and graph stream processing. Flink is developed to fill the gap left by Spark, which uses minibatch streaming processing instead of a pure streaming approach. Flink ensures high processing performance when dealing with complex big data structures such as graphs. Flink programs are regular applications which are written with a rich set of transformation operations (such as mapping, filtering, grouping, aggregating, and joining) to the input datasets. The Flink dataset uses a table-based model; therefore application developers can use index numbers to specify a particular field of a dataset [27, 28].
Flink is able to achieve high throughput and a low latency, thereby processing a bundle of data very quickly. Flink is designed to run on large-scale clusters with thousands of nodes, and in addition to a standalone cluster mode, Flink provides support for YARN. For distributed environment, Flink chains operator subtasks together into tasks. Each task is executed by one thread [16]. Flink runtime consists of two types of processes: there is at least one JobManager (also called masters) which coordinates the distributed execution. It schedules tasks, coordinates checkpoints, and coordinates recovery on failures. A high-availability setup may involve multiple JobManagers, one of which one is always the leader, and the others are standby. The TaskManagers (also called workers) execute the tasks (or, more specifically, the subtasks) of a dataflow/buffer and exchange the data streams. There must always be at least one TaskManager. The JobManagers and TaskManagers can be started in various ways: directly on the machines as a standalone cluster, in containers, or managed by resource frameworks like YARN or Mesos. TaskManagers connect to JobManagers, announcing themselves as available, and are assigned work. Figure 4 demonstrates the main components of Flink framework.
Architectural components of Flink, source: [16].
2.4. Storm
Storm [17] is a free open-source distributed stream processing computation framework. It takes several characteristics from the popular actor model and can be used with practically any kind of programming language for developing applications such as real-time streaming analytics, critical work flow systems, and data delivery services. The engine may process billions of tuples each day in a fault-tolerant way. It can be integrated with popular resource management frameworks such as YARN, Mesos, and Docker. Apache Storm cluster is made up of two types of processing actors: spouts and bolts.
Spout is connected to the external data source of a stream and is continuously emitting or collecting new data for further processing.
Bolt is a processing logic unit within a streaming processing topology; each bolt is responsible for a certain processing task such as transformation, filtering, aggregating, and partitioning.
Storm defines workflow as directed acyclic graphs (DAGs), called topologies with connected spouts and bolts as vertices. Edges in the graph define the link between the bolts and the data stream. Unlike batch jobs being only executed once, Storm jobs run forever until they are killed. There are two types of nodes in a Storm cluster: nimbus (master node) and supervisor (worker node). Nimbus, similar to Hadoop JobTracker, is the core component of Apache Storm and is responsible for distributing load across the cluster, queuing and assigning tasks to different processing units, and monitoring execution status. Each worker node executes a process known as the supervisor which may have one or more worker processes. Supervisor delegates the tasks to worker processes. Worker process then creates a subset of topology to run the task. Apache Storm does rely on an internal distributed messaging system, called Netty, for the communication between nimbus and supervisors. Zookeeper manages the communication between real-time job trackers (nimbus) and supervisors (Storm workers). Figure 5 outlines the high-level view of Storm cluster.
Architecture of Storm Cluster, source: [17].
2.5. Apache Samza
Apache Samza [18] is a distributed stream processing framework, mainly written in Scala and Java. Overall, it has a relatively high throughput as well as somewhat increased latency when compared to Storm [8]. It uses Apache Kafka, which was originally developed for LinkedIn, for messaging and streaming, while Apache Hadoop YARN/Mesos is utilized as an execution platform for overall resource management. Samza relies on Kafka’s semantics to define the way streams are handled. Its main objective is to collect and deliver massively large volumes of event data, in particular, log data with a low latency. A Kafka system’s architecture is comparatively simple as it only consists of a set of brokers which are individual nodes that make up a Kafka cluster. Data streams are defined by topics, which is a stream of related information that consumers can subscribe to. Topics are divided into partitions that are distributed over the broker instances for retrieving the corresponding messages using a pull mechanism. The basic flow of job execution is presented in Figure 6.
Samza architecture, source: [18].
Tables 1 and 2 present a brief comparative analysis of these frameworks based on some common attributes. As shown in the tables, MapReduce computation data flow follows chain of stages with no loop. At each stage, the program proceeds with the output from the previous stage and generates an input for the next stage. Although machine learning algorithms are mostly designed in the form of cyclic data flow, Spark, Storm, and Samza represent it as directed acyclic graph to optimize the execution plan. Flink supports controlled cyclic dependency graph in runtime to represent the machine learning algorithms in a very efficient way. Hadoop and Storm do not provide any default interactive environment. Apache Spark has a command-line interactive shell to use the application features. Flink provides a Scala shell to configure standalone as well as cluster setup. Apache Hadoop is highly scalable and it has been used in the Yahoo production consisting of 42000 nodes in 20 YARN clusters. The largest known cluster size for Spark is of 8000 computing nodes while Storm has been tested on a maximum of 300 node clusters. Apache Samza cluster, with around a hundred nodes, has been used in LinkedIn data flow and application messaging system. Apache Flink has been customized for Alibaba search engine with a deployment capacity of thousands of processing nodes.
Comparison of big data frameworks.
Attribute
Framework
Hadoop
Spark
Storm
Samza
Flink
Current stable version
2.8.1
2.2.0
1.1.1
0.13.0
1.3.2
Batch processing
Yes
Yes
Yes
No
Yes
Computational model
MapReduce
Streaming (microbatches)
Streaming (microbatches)
Streaming
Supports continuous flow streaming, microbatch, and batch
Data flow
Chain of stages
Directed acyclic graph
Directed acyclic graphs (DAGs) with spouts and bolts
Streams (acyclic graph)
Controlled cyclic dependency graph through machine learning
Resource management
YARN
YARN/Mesos
HDFS (YARN)/Mesos
YARN/Mesos
Zookeeper/YARN/Mesos
Language support
All major languages
Java, Scala, Python, and R
Any programming language
JVM languages
Java, Scala, Python, and R
Job management/optimization
MapReduce approach
Catalyst extension
Storm-YARN/3rd-party tools like Ganglia
Internal JobRunner
Internal optimizer
Interactive mode
None (3rd-party tools like Impala can be integrated)
Interactive shell
None
Limited API of Kafka streams
Scala shell
Machine learning libraries
Apache Mahout/H2O
Spark ML and MLlib
Trident-ML/Apache SAMOA
Apache SAMOA
Flink-ML
Maximum reported nodes (scalability)
Yahoo Hadoop cluster with 42,000 nodes
8000
300
LinkedIn with around a hundred node clusters
Alibaba customized Flink cluster with thousands of nodes
Comparative analysis of big data resource frameworks (s=5).
Hadoop
Spark
Flink
Storm
Samza
Processing speed
★★★
★★★★
★★★★★
★★★★
★★★★
Fault tolerance
★★★★
★★
★★★★
★★★
★★★★
Scalability
★★★★★
★★★★
★★★
★★★
★★★
Machine learning
★★
★★★★★
★★★★
★★★
★★★★
Low latency
★★
★★★
★★★
★★★★
★★★★
Security
★★★★
★★★★★
★★★★
★★★★
✘
Dataset size
★★★★★
★★★
★★★★
★★★
★★★
3. Comparative Evaluation of Big Data Frameworks
Big data in cloud computing, a popular research trend, is posing significant influence on current enterprises, IT industries, and research communities. There are a number of disruptive and transformative big data technologies and solutions that are rapidly emanating and evolving in order to provide data-driven insight and innovation. Furthermore, modern cloud computing services are offering all kinds of big data analytic tools, technologies, and computing infrastructures to speed up the data analysis process at an affordable cost. Although many distributed resource management frameworks are available nowadays, the main issue is how to select a suitable big data framework. The selection of one big data platform over the others will come down to the specific application requirements and constraints that may involve several tradeoffs and application usage scenarios. However, we can identify some key factors that need to be fulfilled before deploying a big data application in the cloud. In this section, based on some empirical evidence from the available literature, we discuss the advantages and disadvantages of each resource management framework.
3.1. Processing Speed
Processing speed is an important performance measurement that may be used to evaluate the effectiveness of different resource management frameworks. It is a common metric for the maximum number of I/O operations to disk or memory or the data transfer rate between the computational units of the cluster over a specific amount of time. Based on the context of big data, the average processing speed represented as m-, calculated after n iterations run, is the maximum amount of memory/disk intensive operations that can be performed over a time interval ti:(1)m-=∑i=1nmi∑i=1nti.Veiga et al. [30] conducted a series of experiments on a multicore cluster setup to demonstrate performance results of Apache Hadoop, Spark, and Flink. Apache Spark and Flink resulted to be much efficient execution platforms over Hadoop while performing nonsort benchmarks. It was further noted that Spark showed better performance results for operations such as WordCount and K-Means (CPU-bound in nature) while Flink achieved better results in PageRank algorithm (memory bound in nature). Mavridis and Karatza [31] experimentally compared performance statistics of Apache Hadoop and Spark on Okeanos IaaS cloud platform. For each set of experiments, necessary statistics related to execution time, working nodes, and the dataset size were recorded. Spark performance was found optimal as compared to Hadoop for most of the cases. Furthermore, Spark on YARN platform showed suboptimal results as compared to the case when it was executed in standalone mode. Some similar results were also observed by Zaharia et al. [32] on a 100 GB dataset record. Vellaipandiyan and Raja [33] demonstrated performance evaluation and comparison of Hadoop and Spark frameworks on resident’s record dataset ranging from 100 GB to 900 GB of size. Spark scale of performance was relatively better when the dataset size was between small and medium size (100 GB–750 GB); afterwards, its performance declined as compared to Hadoop. The primary reason for the performance decline was evident as Spark cache size could not fit into the memory for the larger dataset. Taran et al. [34] quantified performance differences of Hadoop and Spark using WordCount dataset which was ranging from 100 KB to 1 GB. It was observed that Hadoop framework was five times faster than Spark when the evaluation was performed using a larger set of data sources. However, for the smaller tasks, Spark showed better performance results. However, the speed-up ratio was decreased for both databases with the growth of input dataset.
Gopalani and Arora [35] used K-Means algorithm on some small, medium, and large location datasets to compare Hadoop and Spark frameworks. The study results showed that Spark performed up to three times better than MapReduce for most of the cases. Bertoni et al. [36] performed the experimental evaluation of Apache Flink and Storm using large genomic dataset data on Amazon EC2 cloud. Apache Flink was superior to Storm while performing histogram and map operations while Storm outperformed Flink while genomic join application was deployed.
3.2. Fault Tolerance
Fault tolerance is the characteristic that enables a system to continue functioning in case of the failure of one or more components. High-performance computing applications involve hundreds of nodes that are interconnected to perform a specific task; failing a node should have zero or minimal effect on overall computation. The tolerance of a system, represented as TolFT, to meet its requirements after a disruption is the ratio of the time to complete tasks without observing any fault events to the overall execution time where some fault events were detected and the system state is reverted back to consistent state: (2)TolFT=TxTx+σ2,where Tx is the estimated correct execution time obtained from a program run that is presumed to be fault-free, or by averaging the execution time from several application runs that produce a known correct output, and σ2 represents the variance in a program’s execution time due to the occurrence of fault events. For an HPC application that consists of a set of computationally intensive tasks Γ={τ1,τ2,…,τn} and since TolFT for each individual task is computed as Tolτi, then the overall application resilience, Tol, may be calculated as [37](3)Tol=Tolτ1·Tolτ2·…·Tolτnn.Lu et al. [38] used StreamBench toolkit to evaluate performance and fault tolerance ability of Apache Spark, Storm, Spark, and Samza. It was found that, with the increased data size, Spark is much stable and fault-tolerant as compared to Storm but may be less efficient when compared to Samza. Furthermore, when compared in terms of handling large capacity of data, both Samza and Spark outperformed Storm. Gu and Li [39] used PageRank algorithm to perform a comparative experiment on Hadoop and Spark frameworks. It was observed that, for smaller datasets such as wiki-Vote and soc-Slashdot0902, Spark outperformed Hadoop with a much better margin. However, this speed-up result degraded with the growth of dataset, and for large datasets, Hadoop easily outperformed Spark. Furthermore, for massively large datasets, Spark was reported to be crashed with JVM heap exception while Hadoop still performed its task. Lopez et al. [40] evaluated throughput and fault tolerance mechanism of Apache Storm and Flink. The experiments were based on a threat detection system where Apache Storm demonstrated better throughput as compared to Flink. For fault tolerance, different virtual machines were manually turned off to analyze the impact of node failures. Apache Flink used its internal subsystem to detect and migrate the failed tasks to other machines and hence resulted in very few message losses. On the other hand, Storm took more time as Zookeeper, involving some performance overhead, was responsible for reporting the state of nimbus and thereafter processing the failed task on other nodes.
3.3. Scalability
Scalability refers to the ability to accommodate large loads or change in size/workload by provisioning of resources at runtime. This can be further categorized as scale-up (by making hardware stronger) or scale-down (by adding additional nodes). One of the critical requirements of enterprises is to process large volumes of data in a timely manner to address high-value business problems. Dynamic resource scalability allows business entities to perform massive computation in parallel, thus reducing overall time, complexity, and effort. The definition of scalability comes from Amdahl’s and Gustafson’s laws [41]. Let W be the size of workload before the improvement of the system resources; the fraction of the execution workload that benefits from the improvement of system resources is α and the fraction concerning the part that would not benefit from improvement in the resources is 1-α. When using an n-processor system, user workload is scaled to(4)W´=αW+1-αnW.The parallel execution time of a scaled workload W´ on n-processors is defined as scaled-workload speed-up S´ as shown in (5)S´=W´W=αW+1-αnWW.García-Gil et al. [42] performed scalability and performance comparison of Apache Spark and Flink using feature selection framework to assemble multiple information theoretic criteria into a single greedy algorithm. The ECBDL14 dataset was used to measure scalability factor for the frameworks. It was observed that Spark scalability performance was 4–10 times faster than Flink. Jakovits and Srirama [43] analyzed four MapReduce based frameworks including Hadoop and Spark for benchmarking partitioning around Medoids, Clustering Large Applications, and Conjugate Gradient linear system solver algorithms using MPI. All experiments were performed on 33 Amazon EC2 large instances cloud. For all algorithms, Spark performed much better as compared to Hadoop, in terms of both performance and scalability. Boden et al. [44] aimed to investigate the scalability with respect to both data size and dimensionality in order to demonstrate a fair and insightful benchmark that reflects the requirements of real-world machine learning applications. The benchmark was comprised of distributed optimization algorithms for supervised learning as well as algorithms for unsupervised learning. For supervised learning, they implemented machine learning algorithms by using Breeze library, while for unsupervised learning they chose the popular k-Means clustering algorithm in order to assess the scalability of the two resource management frameworks. The overall execution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes, while Spark had a clear edge once enough main memory was available due to the addition of new computing nodes.
3.4. Machine Learning and Iterative Tasks Support
Big data applications are inherently complex in nature and usually involve tasks and algorithms that are iterative in nature. These applications have distinct cyclic nature to achieve the desired result by continually repeating a set of tasks until these cannot be substantially reduced further.
Spangenberg et al. [45] used real-world datasets, consisting of four algorithms, that is, WordCount, K-Means, PageRank, and relational query, to benchmark Apache Flink and Storm. It was observed that Apache Storm performs better in batch mode as compared to Flink. However, with the increasing complexity, Apache Flink had a performance advantage over Storm and thus it was better suited for iterative data or graph processing. Shi et al. [46] focused on analyzing Apache MapReduce and Spark for batch and iterative jobs. For smaller datasets, Spark resulted to be a better choice, but when experiments were performed on larger datasets, MapReduce turned out to be several times faster than Spark. For iterative operations such as K-Means, Spark turned out to be 1.5 times faster as compared to MapReduce in its first iteration, while Spark was more than 5 times faster in subsequent operations.
Kang and Lee [47] examined five resource management frameworks including Apache Hadoop and Spark with respect to performance overheads (disk input/output, network communication, scheduling, etc.) in supporting iterative computation. The PageRank algorithm was used to evaluate these performance issues. Since static data processing tends to be a more frequent operation than dynamic data as it is used in every iteration of MapReduce, it may cause significant performance overhead in case of MapReduce. On the other hand, Apache Spark uses read-only cached version of objects (resilient distributed dataset) which can be reused in parallel operations, thus reducing the performance overhead during iterative computation. Lee et al. [48] evaluated five systems including Hadoop and Spark over various workloads to compare against four iterative algorithms. The experimentation was performed on Amazon EC2 cloud. Overall, Spark showed the best performance when iterative operations were performed in main memory. In contrast, the performance of Hadoop was significantly poor as compared to other resource management systems.
3.5. Latency
Big data and low latency are strongly linked. Big data applications provide true value to businesses, but these are mostly time critical. If cloud computing has to be the successful platform for big data implementation, one of the key requirements will be the provisioning of high-speed network to reduce communication latency. Furthermore, big data frameworks usually involve centralized design where the scheduler assigns all tasks through a single node which may significantly impact the latency when the size of data is huge.
Let Telapsed be the elapsed time between the start and finish time of a program in a distributed architecture, Ti be the effective execution time, and λi be the sum of total idle units of ith processor from a set of N processors. Then, the average latency, represented as λ(W,N), for the size of workload W, is defined as the average amount of overhead time needed for each processor to complete the task:(6)λW,N=∑i=1NTelapsed-Ti+λiN.Chintapalli et al. [49] conducted a detailed analysis of Apache Storm, Flink, and Spark streaming engines for latency and throughput. The study results indicated that, for high throughput, Flink and Storm have significantly lower latency as compared to Spark. However, Spark was able to handle high throughput as compared to other streaming engines. Lu et al. [38] proposed StreamBench benchmark framework to evaluate modern distributed stream processing frameworks. The framework includes dataset selection, data generation methodologies, program set description, workload suites design, and metric proposition. Two real-world datasets, AOL Search Data and CAIDA Anonymized Internet Traces Dataset, were used to assess performance, scalability, and fault tolerance aspects of streaming frameworks. It was observed that Storm’s latency in most cases was far less than Spark’s except in the case when the scales of the workload and dataset were massive in nature.
3.6. Security
Instead of the classical HPC environment, where information is stored in-house, many big data applications are now increasingly deployed on the cloud where privacy-sensitive information may be accessed or recorded by different data users with ease. Although data privacy and security issues are not a new topic in the area of distributed computing, their importance is amplified by the wide adoption of cloud computing services for big data platform. The dataset may be exposed to multiple users for different purposes which may lead to security and privacy risks.
Let N be a list of security categories that may be provided for security mechanism. For instance, a framework may use encryption mechanism to provide data security and access control list for authentication and authorization services. Let max(Wi) be the maximum weight that is assigned to the ith security category from a list of N categories and Wi be the reputation score of a particular resource management framework. Then, the framework security ranking score can be represented as(7)Secscore=∑i=1NWi∑i=1NmaxWi.Hadoop and Storm use Kerberos authentication protocol for computing nodes to provide their identity [50]. Spark adopts a password based shared secret configuration as well as Access Control Lists (ACLs) to control the authentication and authorization mechanisms. In Flink, stream brokers are responsible for providing authentication mechanism across multiple services. Apache Samza/Kafka provides no built-in security at the system level.
3.7. Dataset Size Support
Many scientific applications scale up to hundreds of nodes to process massive amount of data that may exceed over hundreds of terabytes. Unless big data applications are properly optimized for larger datasets, this may result in performance degradation with the growth of the data. Furthermore, in many cases, this may result in a crash of software, resulting in loss of time and money. We used the same methodology to collect big data support statistics as presented in earlier sections.
4. Discussion on Big Data Framework
Every big data framework has been evolved for its unique design characteristics and application-specific requirements. Based on the key factors and their empirical evidence to evaluate big data frameworks, as discussed in Section 3, we produce the summary of results of seven evaluation factors in the form of star ranking, RankingRF∈0,s, where s represents the maximum evaluation score. The general equation to produce the ranking, for each resource framework (RF), is given as(8)RankingRF=s∑i=17∑j=1NmaxWij∗∑i=17∑j=1NWij,where N is the total number of research studies for a particular set of evaluation metrics, maxWij∈0,1 is the relative normalized weight assigned to each literature study based on the number of experiments performed, and Wij is the framework test-bed score calculated from the experimentation results from each study.
Hadoop MapReduce has a clear edge on large-scale deployment and larger dataset processing. Hadoop is highly compatible and interoperable with other frameworks. It also offers a reliable fault tolerance mechanism to provide a failure-free mechanism over a long period of time. Hadoop can operate on a low-cost configuration. However, Hadoop is not suitable for real-time applications. It has a significant disadvantage when latency, throughput, and iterative job support for machine learning are the key considerations of application requirements.
Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-time datasets. It is highly suitable for high throughput streaming applications where latency is not a major issue. Spark is memory intensive and all operations take place in memory. As a result, it may crash if enough memory is not available for further operations (before the release of Spark version 1.5, it was not capable of handling datasets larger than the size of RAM and the problem of handling larger dataset still persists in the newer releases with different performance overheads). Few research efforts, such as Project Tungsten, are aimed at addressing the efficiency of memory and CPU for Spark applications. Spark also lacks its own storage system so its integration with HDFS through YARN or Cassandra using Mesos is an extra overhead for cluster configuration.
Apache Flink is a true streaming engine. Flink supports both batch and real-time operations over a common runtime to fulfill the requirements of Lambda architecture. However, it may also work in batch mode by stopping the streaming source. Like Spark, Flink performs all operations in memory, but in case of memory hog, it may also use disk storage to avoid application failure. Flink has some major advantages over Hadoop and Spark by providing better support for iterative processing with high throughput at the cost of low latency.
Apache Storm was designed to provide a scalable, fault tolerance, real-time streaming engine for data analysis, which Hadoop did for batch processing. However, the empirical evidence suggests that Apache Storm proved to be inefficient to meet the scale-up/scale-down requirements for real-time big data applications. Furthermore, since it uses microbath stream processing, it is not very efficient where continuous stream process is a major concern, nor does it provide a mechanism for simple batch processing. For fault tolerance, Storm uses Zookeeper to store the state of the processes which may involve some extra overhead and may also result in message loss. On the other hand, Storm is an ideal solution for near-real-time application processing where workload could be processed with a minimal delay with strict latency requirements.
Apache Samza, in integration Kafka, provides some unique features that are not offered by other stream processing engines. Samza provides a powerful check-pointing based fault tolerance mechanism with minimal data loss. Samza jobs can have high throughput with low latency when integrated with Kafka. However, Samza lacks some important features as data processing engine. Furthermore, it offers no built-in security mechanism for data access control.
To categorize the selection of best resource engine based on a particular set of requirements, we use the framework proposed by Chung et al. [51]. The framework provides a layout for matching, ranking, and selecting a system based on some particular requirements. The matching criterion is based on user goals which are categorized as soft and hard goals. Soft goals represent the nonfunctional requirements of the system (such as security, fault tolerance, and scalability) while hard goals represent the functional aspects of the system (such as machine learning and data size support). The relationship between the system components and the goals can be ranked as very positive (++), positive (+), negative (−), and very negative (−−). Based on the evidence provided from the literature, the categorization of major resource management frameworks is presented in Figure 7.
Categorization of resource engines based on key big data application requirements.
5. Related Work
Our research work differs from other efforts because the subject goal and object of study are not identical, as we provide an in-depth comparison of popular resource engines based on empirical evidence from existing literature. Hesse and Lorenz [8] conducted a conceptual survey on stream processing systems. However, their discussion was focused on some basic differences related to real-time data processing engines. Singh and Reddy [9] provided a thorough analysis of big data analytic platforms that included peer-to-peer networks, field programmable gate arrays (FPGA), Apache Hadoop ecosystem, high-performance computing (HPC) clusters, multicore CPU, and graphics processing unit (GPU). Our case is different here as we are particularly interested in big data processing engines. Finally, Landset et al. [10] focused on machine learning libraries and their evaluation based on ease of use, scalability, and extensibility. However, our work differs from their work as the primary focuses of both studies are not identical.
Chen and Zhang [11] discussed big data problems, challenges, and associated techniques and technologies to address these issues. Several potential techniques including cloud computing, quantum computing, granular computing, and biological computing were investigated and the possible opportunities to explore these domains were demonstrated. However, the performance evaluation was discussed only on theoretical grounds. A taxonomy and detailed analysis of the state of the art in big data 2.0 processing systems were presented in [12]. The focus of the study was to identify current research challenges and highlight opportunities for new innovations and optimization for future research and development. Assunção et al. [13] reviewed multiple generations of data stream processing frameworks that provide mechanisms for resource elasticity to match the demands of stream processing services. The study examined the challenges associated with efficient resource management decisions and suggested solutions derived from the existing research studies. However, the study metrics are restricted to the elasticity/scalability aspect of big data streaming frameworks.
As shown in Table 3, our work differs from the previous studies which focused on the classification of resource management frameworks on theoretical grounds. In contrast to the earlier approaches, we classify and categorize big data resource management frameworks based on empirical grounds, derived from multiple evaluation/experimentation studies. Furthermore, our evaluation/ranking methodology is based on a comprehensive list of study variables which were not addressed in the studies conducted earlier.
Comparison and application areas of related research studies.
Study reference
Data model
Resource frameworks
Study features
Evaluation/ranking methodology
[8]
Data stream processing systems
Storm, Flink, Spark, Samza
A brief comparison of resource frameworks
✘
[9]
Batch and stream processing systems
Horizontal scaling systems, such as peer-to-peer, MapReduce/MPI, and Spark, and vertical scaling systems, such as CUDA and HDL
Comparison of horizontal and vertical scaling systems
Theoretical comparison of resource frameworks
[10]
Batch and stream processing engines
MapReduce, Spark, Flink, and Storm as well as machine learning libraries
Machine learning libraries and their evaluation mechanism
Performance comparison with respect to machine learning toolkits
[11]
Batch and stream processing frameworks
Hadoop, Storm, and other big data frameworks
In-depth analysis of big data opportunities and challenges
✘
[12]
Batch and stream processing frameworks
Hadoop, Spark, Storm, Flink, and Tez as well as SQL, Graph, and bulk synchronous parallel model
Analysis of current open research challenges in the field of big data and the promising directions for future research
✘
[13]
Stream processing engines
Apache Storm, S4, Flink, Samza, Spark Streaming, and Twitter Heron
Classification of elasticity metrics for resource allocation strategies that meet the demands of stream processing services
Evaluation of elasticity/scaling metrics for stream processing systems
6. Conclusions and Future Work
There are a number of disruptive and transformative big data technologies and solutions that are rapidly emanating and evolving in order to provide data-driven insight and innovation. The primary object of the study was how to classify popular big data resource management systems. This study was also aimed at addressing the selection of candidate resource provider based on specific big data application requirements. We surveyed different big data resource management frameworks and investigated the advantages and disadvantages for each of them. We carried out the performance evaluation of resource management engines based on seven key factors and each one of the frameworks was ranked based on the empirical evidence from the literature.
6.1. Observations and Findings
Some key findings of the study are as follows:
In terms of processing speed, Apache Flink outperforms other resource management frameworks for small, medium, and large datasets [30, 36]. However, during our own set of experiments on Amazon EC2 cluster with varied task managers settings (1–4 task managers per node), Flink failed to complete custom smaller size JVM dataset jobs due to inefficient memory management of Flink memory manager. We could not find any reported evidence of this particular use case in these relevant literature studies. It seems that most of the performance evaluation studies employed standard benchmarking test sets where dataset size was relatively large and hence this particular use case was not reported in these studies. Further research effort is required to elucidate the underlying specific of factors under this particular case.
Big data applications usually involve massive amount of data. Apache Spark supports necessary strategies for fault tolerance mechanism, but it has been reported to crash on larger datasets. Even during our experimentations, Apache Spark version 1.6 (selected due to the compatibility reasons with earlier researches) crashed on several occasions when the dataset was larger than 500 GB. Although Spark has been ranked higher in terms of fault tolerance with the increase of data scale in several studies [38, 52], it has limitations in handling larger typical big data dataset applications and hence such studies cannot be generalized.
Spark MLlib and Flink-ML offer a variety of machine learning algorithms and utilities to exploit distributed and scalable big data applications. Spark MLlib outperforms Flink-ML in most of the machine learning use cases [42], except in the case when repeated passes are performed on unchanged data. However, this performance evaluation may further be investigated as research studies such as [53], reported differently where Flink outperformed Spark on sufficiently large cases.
For graph processing algorithms such as PageRank, Flink uses Gelly library that provides native closed-loop iteration operators, making it a suitable platform for large-scale graph analytics. Spark, on the other hand, uses GraphX library that has much longer preprocessing time to build graph and other data structures, making its performance worse as compared to Apache Flink. Apache Flink has been reported to obtain the best results for graph processing datasets (3x–5x in [54] and 2x-3x in [45]) as compared to Spark. However, some studies such as [55] reported Spark to be 1.7x faster than Flink for large graph processing. Such inconsistent behavior may be further investigated in the future research studies.
6.2. Future Work
In the area of big data, there is still a clear gap that requires more effort from the research community to build in-depth understanding of performance characteristics of big data resource management frameworks. We consider this study a step towards enlarging our knowledge to understand the big data world and provide an effort towards the direction of improving the state of the art and achieving the big vision on the big data domain. In earlier studies, a clear ranking cannot be established as the study parameters were mostly limited to a few issues such as throughput, latency, and machine learning. Furthermore, further investigation is required on resource engines such as Apache Samza in comparison with other frameworks. In addition, research effort needs to be carried out in several areas such as data organization, platform specific tools, and technological issues in big data domain in order to create next-generation big data infrastructures. The performance evaluation factors might also vary among systems depending on the used algorithms. As future work, we plan to benchmark these popular resource engines for meeting resource demands and requirements for different scientific applications. Moreover, a scalability analysis could be done. Particularly, the performance evaluation of adding dynamic worker nodes and the resulting performance analysis is of peculiar interest. Additionally, further research can be carried out in order to evaluate performance aspects with respect to resource competition between jobs (on different research schedulers such as YARN and Mesos) and the fluctuation of available computing resources. Finally, most of the experimentations in earlier studies were performed using standard parameter configurations; however, each resource management framework offers domain specific tweaks and configuration optimization mechanisms for meeting application-specific requirements. The development of a benchmark suite that aims to find maximum throughput based on configuration optimization would be an interesting direction of future research.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Big Data in the Cloud: Converging Technologies, (August 11)https://www.intel.com/content/dam/www/public/emea/de/de/documents/product-briefs/big-data-cloud-technologies-brief.pdfHeQ.YanJ.KowalczykR.JinH.YangY.Lifetime service level agreement management with autonomous agents for services provision200917915259126052-s2.0-6734922326410.1016/j.ins.2009.01.037RanaO.WarnierM.QuillinanT. B.BrazierF.Monitoring and reputation mechanisms for service level agreements20085206125139Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface10.1007/978-3-540-85485-2_102-s2.0-52149099889YuhannaN.GualtieriM.The Forrester Wave™: Big Data Hadoop Cloud Solutions, Q2 2016 Elasticity, Automation, and Pay-As-You-Go Compel Enterprise Adoption of Hadoop in the Cloud, (June 20)https://ncmedia.azureedge.net/ncmedia/2016/05/The_Forrester_Wave__Big_D.pdfAroraR.An introduction to big data, high performance computing, high-throughput computing, and Hadoop201611210.1007/978-3-319-33742-5_12-s2.0-85017608093PopF.KołodziejJ.MartinoB. D.2016SpringerGribaudoM.IaconoM.PalmieriF.Performance modeling of big data-oriented architectures2016ChamSpringer International Publishing334Computer Communications and Networks10.1007/978-3-319-44881-7_1HesseG.LorenzM.Conceptual survey on data stream processing systemsProceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)December 2015Melbourne, Australia79780210.1109/ICPADS.2015.106SinghD.ReddyC. K.A survey on platforms for big data analytics20142810.1186/s40537-014-0008-6LandsetS.KhoshgoftaarT. M.RichterA. N.HasaninT.A survey of open source tools for machine learning with big data in the Hadoop ecosystem2015212410.1186/s40537-015-0032-12-s2.0-85013974691ChenC. L. P.ZhangC. Y.Data-intensive applications, challenges, techniques and technologies: a survey on Big Data201427531434710.1016/j.ins.2014.01.015BajaberF.ElshawiR.BatarfiO.AltalhiA.BarnawiA.SakrS.Big data 2.0 processing systems: taxonomy and open challenges201614337940510.1007/s10723-016-9371-12-s2.0-84976271964AssunçãoM. D. d.VeithA. d. S.BuyyaR.Distributed data stream processing and edge computing: a survey on resource elasticity and future directions201810311710.1016/j.jnca.2017.12.0012-s2.0-85037546359Big data working group big data taxonomy, 2014InoubliW.AridhiS.MezniH.MaddouriM.NguifoE. M.An experimental survey on big data frameworks201810.1016/j.future.2018.04.032KwonH.-K.SeoK.-K.A fuzzy AHP based multi-criteria decision-making model to select a cloud service2014831751802-s2.0-8490177914810.14257/ijsh.2014.8.3.16RezguiA.RezguiS.A stochastic approach for virtual machine placement in volunteer cloud federationsProceedings of the 2nd IEEE International Conference on Cloud Engineering, IC2E 2014March 2014Boston, MA, USA27728210.1109/IC2E.2014.852-s2.0-84908584395SalamaM.ZeidA.ShawishA.JiangX.A novel QoS-based framework for cloud computing service provider selection201442487210.4018/ijcac.2014040104MateescuG.GentzschW.RibbensC. J.Hybrid computing-where hpc meets grid and cloud computing201127544045310.1016/j.future.2010.11.0032-s2.0-79951850492FosterI.KesselmanC.TueckeS.The anatomy of the grid: enabling scalable virtual organizations200115320022210.1177/109434200101500302Zbl1005.686822-s2.0-0035455653BenedictS.Performance issues and performance analysis tools for HPC cloud applications: a survey20139528910810.1007/s00607-012-0213-02-s2.0-84878556360InacioE. C.DantasM. A. R.A survey into performance and energy efficiency in HPC, cloud and big data environments20141442993182-s2.0-8492450676910.1504/IJNVO.2014.067878YuhannaN.GualtieriM.Elasticity, Automation, and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in the Cloud, Q2, 2016UllahS.AwanM. D.KhiyalM. S.A price-performance analysis of EC2, google compute and rackspace cloud providers for scientific computing2016160217819210.22436/jmcs.016.02.06HurwitzJ.NugentA.HalperF.KaufmanM.2013John Wiley SonsGargS. K.VersteegS.BuyyaR.A framework for ranking of cloud computing services20132941012102310.1016/j.future.2012.06.0062-s2.0-84865048353WuD.SakrS.ZhuL.Big data programming models2017316310.1007/978-3-319-49340-4_22-s2.0-85019961157GarcíaS.Ramírez-GallegoS.LuengoJ.BenítezJ. M.HerreraF.Big data preprocessing: methods and prospects20161110.1186/s41044-016-0014-0LiA.YangX.KandulaS.ZhangM.CloudCmp: comparing public cloud providersProceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC '10)November 2010Melbourne, AustraliaACM11410.1145/1879141.18791432-s2.0-78650877494VeigaJ.ExpositoR. R.PardoX. C.TaboadaG. L.TourifioJ.Performance evaluation of big data frameworks for large-scale data analyticsProceedings of the 4th IEEE International Conference on Big Data, Big Data 2016December 201642443110.1109/BigData.2016.78406332-s2.0-85015243611MavridisI.KaratzaH.Log file analysis in cloud with Apache Hadoop and Apache SparkProceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015)2015PolandZahariaM.ChowdhuryM.DasT.Fast and interactive analytics over Hadoop data with Spark20123744551VellaipandiyanS.RajaP. V.Performance evaluation of distributed framework over YARN cluster managerProceedings of the 2016 IEEE International Conference on Computational Intelligence and Computing Research, ICCIC 2016December 201610.1109/ICCIC.2016.79195202-s2.0-85020023272TaranV.AlieninO.StirenkoS.GordienkoY.RojbiA.Performance evaluation of distributed computing environments with Hadoop and Spark frameworksProceedings of the 2017 IEEE International Young Scientists' Forum on Applied Physics and Engineering (YSF)October 2017Lviv, Ukraine808310.1109/YSF.2017.8126655GopalaniS.AroraR.Comparing Apache Spark and Map Reduce with performance analysis using K-means2015113181110.5120/19788-0531BertoniM.CeriS.KaitouaA.PinoliP.Evaluating cloud frameworks on genomic applicationsProceedings of the 3rd IEEE International Conference on Big Data, IEEE Big Data 2015November 201519320210.1109/BigData.2015.73637562-s2.0-84963780972HukerikarS.AshrafR. A.EngelmannC.Towards new metrics for high-performance computing resilienceProceedings of the 7th Fault Tolerance for HPC at eXtreme Scale Workshop, FTXS 2017June 2017233010.1145/3086157.30861632-s2.0-85025814574LuR.WuG.XieB.HuJ.Stream bench: towards benchmarking modern distributed stream computing frameworksProceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2014December 2014697810.1109/UCC.2014.152-s2.0-84946688481GuL.LiH.Memory or time: performance evaluation for iterative operation on Hadoop and SparkProceedings of the 15th IEEE International Conference on High Performance Computing and Communications, HPCC 2013 and 11th IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, EUC 2013November 201372172710.1109/HPCC.and.EUC.2013.1062-s2.0-84903975456LopezM. A.LobatoA. G. P.DuarteO. C. M. B.A performance comparison of open-source stream processing platformsProceedings of the 59th IEEE Global Communications Conference, GLOBECOM 2016December 201610.1109/GLOCOM.2016.78415332-s2.0-85015420127GustafsonJ. L.Reevaluating Amdahl's law198831553253310.1145/42411.424152-s2.0-0024012163García-GilD.Ramírez-GallegoS.GarcíaS.HerreraF.A comparison on scalability for batch big data processing on Apache Spark and Apache Flink20172110.1186/s41044-016-0020-2JakovitsP.SriramaS. N.Evaluating Mapreduce frameworks for iterative scientific computing applicationsProceedings of the 2014 International Conference on High Performance Computing and Simulation, HPCS 2014July 201422623310.1109/HPCSim.2014.69036902-s2.0-84908626183BodenC.SpinaA.RablT.MarklV.Benchmarking data flow systems for scalable machine learningProceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2017May 201710.1145/3070607.30706122-s2.0-85021393298SpangenbergN.RothM.FranczykB.2015208Publisher Name Springer, ChamLecture Notes in Business Information Processing10.1007/978-3-319-19027-3_32-s2.0-84946414925ShiJ.QiuY.MinhasU. F.JiaoL.WangC.ReinwaldB.ÖzcanF.Clash of the titans: MapReduce vs. Spark for large scale data analyticsProceedings of the 41st International Conference on Very Large Data Bases2015Kohala Coast, HI, USA2110212110.14778/2831360.2831365KangM.LeeJ.A comparative analysis of iterative MapReduce systemsProceedings of the the Sixth International ConferenceOctober 2016Jeju, Repbulic of Korea616410.1145/3007818.3007819LeeH.KangM.YounS.-B.LeeJ.-G.KwonY.An experimental comparison of iterative MapReduce frameworksProceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016October 20162089209410.1145/2983323.29836472-s2.0-84996550782ChintapalliS.DagitD.EvansB.FarivarR.GravesT.HolderbaughM.LiuZ.NusbaumK.PatilK.PengB. J.PouloskyP.Benchmarking streaming computation engines: Storm, Flink and Spark streamingProceedings of the 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016May 20161789179210.1109/IPDPSW.2016.1382-s2.0-84991585535ZhangX.LiuC.NepalS.DouW.ChenJ.Privacy-preserving layer over MapReduce on cloudProceedings of the 2nd International Conference on Cloud and Green Computing, CGC 2012, Held Jointly with the 2nd International Conference on Social Computing and Its Applications, SCA 2012November 2012chn3043102-s2.0-8487459507310.1109/CGC.2012.43ChungL.NixonB. A.YuE.Using nonfunctional requirements to systematically select among alternatives in architectural designProceedings of the 1st International Workshop on Architectures for Software Systems19953143QianS.WuG.HuangJ.DasT.Benchmarking modern distributed streaming platformsProceedings of the 2016 IEEE International Conference on Industrial Technology (ICIT)March 2016Taipei, Taiwan59259810.1109/ICIT.2016.7474816DambrevilleF.ToumiA.Generic and massively concurrent computation of belief combination rulesProceedings of the the International Conference on Big Data and Advanced Wireless TechnologiesNovember 2016Blagoevgrad, Bulgaria1610.1145/3010089.3010136TechentinR. W.MarklandM. W.PooleR. J.HaiderD. R. H. C. R.GilbertB. K.Page rank performance evaluation of cluster computing frameworks on Cray Urika-GX supercomputer201663338MarcuO.-C.CostanA.AntoniuG.Pérez-HernándezM. S.Spark versus flink: understanding performance in big data analytics frameworksProceedings of the 2016 IEEE International Conference on Cluster Computing, CLUSTER 2016September 2016Taipei, Taiwan43344210.1109/CLUSTER.2016.222-s2.0-85013187492