Ensuring Network Designs Meet Performance Requirements under Failures
With the prevalence of web and cloud-based services, there is an ever growing requirement on the underlying network infrastructure to ensure that business critical traffic is continually serviced with acceptable performance. Networks must meet their performance requirements under failures. The global scale of cloud provider networks and the rapid evolution of these networks imply that failures are the norm in production networks today. Unplanned downtime can cost billions of dollars, and cause catastrophic consequences. The thesis is motivated by these challenges and aims to provide a principled solution to certifying network performance under failures. Network performance certification is complicated, due to both the variety of ways a network can fail, and the rich ways a network can respond to failures. The key contributions of this thesis are: (i) a general framework for robustly certifying the worst-case performance of a network across a given set of uncertain scenarios. A key novelty is that the framework models flexible network response enabled by recent emerging trends such as Software-Defined Networking; (ii) a toolkit which automates the key steps needed in robust certification making it suitable for use by a network architect, and which enables experimentation on a wide range of robust certification of practical interest; (iii) Slice, a general framework which efficiently classifies failure scenarios based on whether network performance is acceptable for those scenarios, and which allows reasoning about performance requirements that must be met over a given percentage of scenarios. We also show applications of our frameworks in synthesizing designs that are guaranteed to meet a performance goal over all or a desired percentage of a given set of scenarios. The thesis focuses on wide-area networks, but the approaches apply to data-center networks as well.