DEPENDABLE CLOUD RESOURCES FOR BIG-DATA BATCH PROCESSING & STREAMING FRAMEWORKS
thesisposted on 07.05.2021, 14:41 by Bara M Abusalah
The examiner of cloud computing systems in the last few years observes that there is a trend of the emergence of new Big Data frameworks every single year. Since Hadoop was developed in 2007, new frameworks followed it such as Spark, Storm, Heron, Apex, Flink, Samza, Kafka ... etc. Each framework is developed in a certain way to target and achieve certain objectives better than other frameworks do. However, there are few common functionalities and aspects that are shared between these frameworks. One vital aspect all these frameworks strive to achieve is better reliability and faster recovery time in case of failures. Despite all the advances in making datacenters dependable, failures actually still happen. This is particularly onerous for long-running “big data” applications, where partial failures can lead to significant losses and lengthy recomputations. This is also crucial for streaming systems where events are processed and monitored online in real time, and any delay in data delivery will cause a major inconvenience to the users.
Another observation is that some reliability implementations are redundant between different frameworks. Big data processing frameworks like Hadoop MapReduce include fault tolerance mechanisms, but these are commonly targeted at specific system/failure models, and are often redundant between frameworks. Encapsulating these implementations into one layer and making it shared between different applications will benefit more than one frame-work without the burden of re-implementing the same reliability approach in each single framework.
These observations motivated us to solve the problem by presenting two systems: Guardian and Warden. Guardian is tailored towards batch processing big data systems while Warden is targeted towards stream processing systems. Both systems are robust, RMS based, generic, multi-framework, flexible, customizable, low overhead systems that allow their users to run their applications with individually configurable fault tolerance granularity and degree, with only minor changes to their implementation.
Most reliability approaches carry out one rigid fault tolerance technique targeted towards one system at a time. It is more challenging to provide a reliability approach that is pluggable in multiple Big Data frameworks at a time and can achieve low overheads comparable with single targeted framework approaches, yet is flexible and customizable by its users to make it tailored towards their objectives. The genericity is attained by providing an interface that can be used in different applications from different frameworks in any part of the application code. The low overhead is achieved by providing faster application finish times with and without failures. The customizability is fulfilled by providing the users the options to choose between two fault tolerance guarantees (Crash Failures / Byzantine Failures) and, in case of streaming systems; it is combined with two delivery semantics (Exactly Once / At Most Once).
In other words, this thesis proposes the paradigm of dependable resources: big data processing frameworks are typically built on top of resource management systems (RMSs),and proposing fault tolerance support at the level of such an RMS yields generic fault tolerance mechanisms, which can be provided with low overhead by leveraging constraints on resources.
To the best of our knowledge, such approach was never tried on multiple big data batch processing and streaming frameworks before.
We demonstrate the benefits of Guardian by evaluating some batch processing frame-works such as Hadoop, Tez, Spark and Pig on a prototype of Guardian running on Amazon-EC2, improving completion time by around 68% in the presence of failures, while maintaining around 6% overhead. We’ve also built a prototype of Warden on the Flink and Samza (with Kafka) streaming frameworks. Our evaluations on Warden highlight the effectiveness of our approach in the presence of failures and without failures compared to other fault tolerance techniques (such as checkpointing)