An Efficient Mapreduce Scheduler for Cloud Environment

Jeyaraj, Rathinaraja.

Please use this identifier to cite or link to this item: https://idr.l4.nitk.ac.in/jspui/handle/123456789/16866

Title:	An Efficient Mapreduce Scheduler for Cloud Environment
Authors:	Jeyaraj, Rathinaraja.
Supervisors:	S, Ananthanarayana V.
Keywords:	Department of Information Technology;Bin Packing;Combiner;Heterogeneous Performance;Heterogeneous MapReduce Workloads;MapReduce Job Scheduler;MapReduce Task Placement
Issue Date:	2020
Publisher:	National Institute of Technology Karnataka, Surathkal
Abstract:	Hadoop MapReduce is one of the cost-effective ways to process a large volume of data for reliable and effective decision-making. As on-premise Hadoop cluster is not affordable for short-term users, many public cloud service providers like Amazon, Google, and Microsoft typically offer Hadoop MapReduce and relevant applications as a service via a cluster of virtual machines over the Internet. In general, these Hadoop virtual machines are launched in different physical machines across cloud data-center and co-located with non-Hadoop virtual machines. It introduces many challenges, more specifically, a layer of heterogeneities (hardware heterogeneity, virtual machine heterogeneity, performance heterogeneity, and workload heterogeneity) that impacts the performance of MapReduce job and task scheduler. Containing physical servers of different configuration and performance in cloud data-centers is called hardware heterogeneity. Existence of different size of virtual machines in a Hadoop virtual cluster is called virtual machine heterogeneity. Hardware heterogeneity, virtual machine heterogeneity, and co-located non-Hadoop virtual machine’s interference together cause varying performance for the same map/reduce task of a job. This is called performance heterogeneity. Latest MapReduce versions allow users to customize the resource capacity (container size) for the map/reduce tasks of different jobs. This leads a batch of MapReduce of jobs to be heterogeneous. These heterogeneities are inevitable and profoundly affect the performance of MapReduce job and task scheduler concerning job latency, makespan, and virtual resource utilization. Therefore, it is essential to exploit these heterogeneities while offering Hadoop MapReduce as a service to improve MapReduce scheduler performance in real-time. Existing MapReduce job and task schedulers addressed some of these heterogeneities but fell short in improving the performance. In order to improve these qualities of service further, we proposed a following set of methods: Dynamic Ranking-based MapReduce Job Scheduler (DRMJS) to exploit performance heterogeneity, Multi-Level Per Node Combiner (MLPNC) to minimize the number of intermediate records in the shuffle phase, Roulette Wheel Scheme (RWS) based data block placement and a constrained 2-dimensional bin packing model to exploit virtual machine and workload level heteroigeneities, and Fine-Grained Data Locality Aware (FGDLA) job scheduling by extending MLPNC for a batch of jobs. Firstly, DRMJS is proposed to improve MapReduce job latency and resource utilization by exploiting heterogeneous performance. The DRMJS calculates the performance score for each Hadoop virtual machine based on CPU and Disk IO for map tasks, CPU and Network IO for reduce tasks separately. Then, a rank list is prepared for scheduling map tasks based on map performance score, and reduce tasks based on reduce performance score. Ultimately, DRMJS improved overall job latency, makespan, and resource utilization up to 30%, 28%, and 60%, respectively, on average compared to existing MapReduce schedulers. To improve job latency further, MLPNC is introduced to minimize the number of intermediate records in the shuffle phase, which is responsible for the significant portion of MapReduce job latency. In general, each map task runs a dedicated combiner function to minimize the number of intermediate records. In MLPNC, we split the combiner function from map task and run a single MLPNC in every Hadoop virtual machine for a set of map tasks of the same job. These map tasks write its output to the common MLPNC, which minimizes the number of intermediate records level by level. Ultimately, MLPNC improved job latency up to 33% compared to existing MapReduce schedulers for a single job. However, in production environment, a batch of MapReduce jobs is periodically executed. Therefore, to extend MLPNC for a batch of jobs, we introduced FGDLA job scheduler. Results showed that FGDLA minimized the amount of intermediate data and makespan up to 62.1% and 32.4% when compared to existing schedulers. Secondly, virtual machine and workload level heterogeneities cause resource underutilization in the Hadoop virtual cluster and impact makespan for a batch of MapReduce jobs. Considering this, we proposed RWS based data block placement, and a constrained 2-dimensional bin packing to place heterogeneous map/reduce tasks onto heterogeneous virtual machines. RWS places data blocks based on the processing capacity of each virtual machine, and bin packing model helps to find the right combination of map/reduce tasks of different jobs for each bin to improve makespan and resource utilization. The experimental results showed that the proposed model improved makespan iiand resource utilization up to 57.9% and 59.3% over MapReduce fair scheduler.
URI:	http://idr.nitk.ac.in/jspui/handle/123456789/16866
Appears in Collections:	1. Ph.D Theses

Files in This Item:

File	Description	Size	Format
155031 IT15F01.pdf		1.81 MB	Adobe PDF	View/Open

Show full item record