MULTI-FACTOR DESIGNED EXPERIMENTS FOR COMPUTATIONAL PERFORMANCE MEASUREMENT ANALYSIS (CPM&A) OF PARALLEL DISTRIBUTED BIG DATA SYSTEMS AND SPAMHAUS DATA ANALYSIS
thesisposted on 2020-12-10, 20:42 authored by Yuying SongYuying Song
Big Data analysis is of great challenges in practice. The data set sizes will grow quickly. And analysing big dataset in a timely manner is critical. In fact, computational performance depends very heavily, not just on size, but on the computational complexity of the analytic routines used in the analysis. Datasets that have computational challenges have a very wide range of sizes. Furthermore, the hardware power available to the data analyst is also an important factor. Improvements in performance from better measurement and analysis can be provided for wide ranges of dataset size, computational complexity, and hardware power. In my first part of dissertation, I will develop an overall framework of practices that can provide guidance to big data platform performance measurement. It has two main impacts, one is to provide a rigorous and comprehensive performance experiment framework on computing methods and systems for big-data analytics by bringing the statistical thinking, the statistical experimental design, and the statistical modeling to the research community. The other is to enhance performance improvement by providing a much better understanding of performance.
In the second part of the dissertation, I will analyze a 1TB Spamhaus Blacklisted Data by applying Divide and Recombine on Hadoop. The spamhaus data was collected from the Stanford mirror of the Spamhaus Internet IP address and domain name blacklist site. The Spamhaus service classifies IP addresses and domain names as blacklisted or not based on many sources of information and many factors such as being a major conduit for spam. Queries are sent to the site about the status of an IP address or domain name, whether it is blacklisted or not, and if blacklisted the cause. The processed data consist of values of 13 variables for each of 13,178,080,366 queries during 8 months. Subject matter divisions will be carried out and the blacklisted properties will be analyzed. The data were analyzed on Professor Cleveland's 10-node cluster, wsc; it has a little over 1 TB of memory and 200 cores. An important property of the blacklisting and its cause were discovered.