TOWARDS AN UNDERSTANDING OF RESIDUAL NETWORKS USING NEURAL TANGENT HIERARCHY
thesisposted on 06.05.2021, 14:49 by Yuqing Li
Deep learning has become an important toolkit for data science and artificial intelligence. In contrast to its practical success across a wide range of fields, theoretical understanding of the principles behind the success of deep learning has been an issue of controversy. Optimization, as an important component of theoretical machine learning, has attracted much attention. The optimization problems induced from deep learning is often non-convex and
non-smooth, which is challenging to locate the global optima. However, in practice, global convergence of first-order methods like gradient descent can be guaranteed for deep neural networks. In particular, gradient descent yields zero training loss in polynomial time for deep neural networks despite its non-convex nature. Besides that, another mysterious phenomenon is the compelling performance of Deep Residual Network (ResNet). Not only
does training ResNet require weaker conditions, the employment of residual connections by ResNet even enables first-order methods to train the neural networks with an order of magnitude more layers. Advantages arising from the usage of residual connections remain to be discovered.
In this thesis, we demystify these two phenomena accordingly. Firstly, we contribute to further understanding of gradient descent. The core of our analysis is the neural tangent hierarchy (NTH) that captures the gradient descent dynamics of deep neural networks. A recent work introduced the Neural Tangent Kernel (NTK) and proved that the limiting
NTK describes the asymptotic behavior of neural networks trained by gradient descent in the infinite width limit. The NTH outperforms the NTK in two ways: (i) It can directly study the time variation of NTK for neural networks. (ii) It improves the result to non-asymptotic settings. Moreover, by applying NTH to ResNet with smooth and Lipschitz activation function, we reduce the requirement on the layer width m with respect to the number of training samples n from quartic to cubic, obtaining a state-of-the-art result. Secondly, we extend our scope of analysis to structural properties of deep neural networks. By making fair and consistent comparisons between fully-connected network and ResNet, we suggest strongly that the particular skip-connection architecture possessed by ResNet is the main
reason for its triumph over fully-connected network.